DeepV2D icon indicating copy to clipboard operation
DeepV2D copied to clipboard

Is your method end-2-end ?

Open phongnhhn92 opened this issue 4 years ago • 4 comments

Hello, I have read your paper ! Thanks for uploading the code. However, I would like to ask if your method can be trained end-2-end. As I understand, the Depth module will build a cost volume around the key frame and then use 3D CNN network to predict the depth of that keyframe. In the Motion module, images and depths are required as the input to predict the relative poses. If you have N = 5 input images, does it mean that you have to run your Depth module N times to get all N depth maps as input to the Motion module.

phongnhhn92 avatar Jun 22 '20 07:06 phongnhhn92

Hi, we unroll a single step during training (1 motion update and 1 depth update). This is end-to-end in the sense that we can backprogate the gradient on the depth output back through the motion module.

Due to memory constraints, we only compute the depth for a single frame in the video during training. However, having the depth for a single frame is sufficient as input to the motion module, this corresponds to the "Keyframe Pose Optimization" Sec. 3.2 in our paper. Our network is trained in this setting.

At inference time, you can run DeepV2D in "global" mode --mode=global where the depth for all frames are computed as input to the motion module. This is done in a single forward pass by automatically batching the frames.

zachteed avatar Jun 23 '20 15:06 zachteed

KeyframeGraph

This picture might help. During training we operate in the local mode, where only the depth for a single keyframe is estimated, this is sufficent to estimate the pose of all frames. During inference, we can operate in global mode, which estimates the depth for all frames. This introduces redundant constraints which gives some improvement in performance. Each edge in the graph corresponds to estimating the optical flow between pairs of frames. Keyframes can have both outgoing and incoming edges, while one-way frames (without depth) can have only incoming edges.

This graph is used to define the objective function in Eq. 5, where pairs (i,j) \in C correspond to edges in the graph.

zachteed avatar Jun 23 '20 15:06 zachteed

Hi, Thanks for your reply! So do you mean that during training depth module will only predict the depth map of the key frame and use it to concat with images of different timestep in the Motion module ? I am sorry if my question are a bit too much.

phongnhhn92 avatar Jun 24 '20 10:06 phongnhhn92

Hi, yes during training we only predict the depth for a keyframe (taken to be the first frame in the sequence). However, with more GPU memory, or a smaller batch size it would certainly be possible with the code to use 2 or more keyframes.

But we don't concatenate depth with the images. Instead, the motion module estimates the optical flow between the keyframe and each of the other frames. The optical flow and depth are then used as input to a Least-Squares optimizaton layer which uses the flow & depth to solve for the pose update.

zachteed avatar Jun 25 '20 04:06 zachteed