CodeSLAM icon indicating copy to clipboard operation
CodeSLAM copied to clipboard

Implementation of CodeSLAM — Learning a Compact, Optimisable Representation for Dense Visual SLAM paper (


PyTorch implementation of CodeSLAM - Learning a Compact, Optimisable Representation for Dense Visual SLAM.


Problems it tries to tackle/solve

  • Representation of geometry in real 3D perception systems.
    • Dense representations, possibly augmented with semantic labels are high dimensional and unsuitable for probabilistic inference.
    • Sparse representations, which avoid these problems but capture only partial scene information.

The new approach/solution

  • New compact but dense representation of scene geometry, conditioned on the intensity data from a single image and generated from a code consisting of a small number of parameters.
  • Each keyframe can produce a depth map, but the code can be optimised jointly with pose variables and with the codes of overlapping keyframes, for global consistency.


  • As the uncertainty propagation quickly becomes intractable for large degrees of freedom, the approaches on SLAM are split into 2 categories:
    • sparse SLAM, representing geometry by a sparse set of features
    • dense SLAM, that attempts to retrieve a more complete description of the environment.
  • The geometry of natural scenes exhibits a high degree of order, so we may not need a large number of params</U> to represent it.
  • Besides that, a scene could be decomposed into a set of semantic objects (e.g a chair) together with some internal params (e.g. size of chair, no of legs) and a pose. Other more general scene elements, which exhibit simple regularity, can be recognised and parametrised within SLAM systems.
  • A straightforward AE might oversimplify the reconstruction of natural scenes, the novelty is to condition the training on intensity images.
  • A scene map consists of a set of selected and estimated <U>historical camera poses together with the corresponding captured images</U> and supplementary local information such as depth estimates. The intensity images are usually required for additional tasks.
  • Depth map estimate becomes a function of corresponding intensity image and an unknown compact representation (referred to as code).
  • We can think of the image providing local details and the code supplying more global shape params and can be seen as a step towards enabling optimisation in general semantic space.
  • The 2 key contributions of this paper are:
    • The derivation of a compact and optimisable representation of dense geometry by conditioning a depth autoencoder on intensity images.
    • The implementation of the first real-time targeted monocular system that achieves such a tight joint optimisation of motion and dense geometry.


  • generate the python module for the protobuf: protoc --python_out=./ scenenet.proto



  • Python 3.4+
  • PyTorch 1.0+
  • Torchvision 0.4.0+