End-to-End Egospheric Spatial Memory ICLR 2021
1Dyson Robotics Lab, 2Department of ComputingImperial College London
Overview
Egospheric Spatial Memory (ESM) encodes the memory in an ego-sphere around the agent, enabling expressive 3D representations. ESM can be trained end-to-end via either imitation or reinforcement learning, and improves both training efficiency and final performance against other memory baselines on both drone and manipulator visuomotor control tasks.ESM is a parameter-free module, and relies on forward warp reprojections for updating the memory. The egospheric memory at time t-1 is combined with the new observations at time t, to produce the new egospheric memory at time t.
Off-the-Shelf Mapping
Assuming access to a stream of depth and color images, and access to camera pose estimates, the ESM module can be used for off-the-shelf real-time egocentric mapping, with color values projected into memory.Neural Network Integration
The real stength of ESM arises when training end-to-end as part of a wider neural network. The ESM module can be combined with both pre-module convolutions and post-module convolutions, for solving a variety of downstream tasks. The pre-module convolutions enable learnt features to be stored in the module, optimized for any downstream task. The post-module convolutions can then use this stored representation to execute the task. We refer to networks with both pre and post module convolutions as Egospheric Spatial Memory Networks (ESMN). For some tasks, the post-module convolutions are sufficient, with color values projected into the memory. We refer to these networks as ESMN-RGB.We compare against other less structured memory baselines, such as long short term memory (LSTM), and neural turing machines (NTM). The baseline methods are given access to all the same information, including ground truth poses.
Image to Action Learning
We test ESM in a variety of image-to-action reacher tasks. We test for 6DOF control of both drones and robot manipulators, using either onboard or freely moving cameras, with networks conditioned on target shape or target color, trained using either imitation learning or reinforcement learning. In all cases, we find that ESMN and ESMN-RGB outperform less structured memory baselines, such as long short term memory, and neural turing machines.The explicit geometry also enables seamless integration with other non-learnt control strategies, such as local obstacle avoidance. The egocentric geometry leads to more robust avoidance than is possible using individual depth frames.