VINet Training did not converge

Recently, I tried to implement VINet[1] and open source it to GitHub HTLife/VINet

I already complete whole network structure, but the network can't converge properly in training. How could I fix this problem?

Possible problems & solutions:

The dataset is too challenging: I'm using the EuRoC MAV dataset, which is more challenging than the KITTI VO Dataset used by the DeepVO, Vinet(because the KITTI vehicle image does not shake up and down). NN cannot learn camera movement correctly.
Loss function: L1 loss is been used and identical to the design in [1]. (I'm not very confident about whether I understand the loss design in [1] currently.) Related code
Other hyperparameter problems

Chinese translation

VINet 實作，訓練無法收斂，我的網路設計是否有誤？最近嘗試復現了VINet[1]並將其開源至GitHub HTLife/VINet

目前己經完整整體架構，但在訓練上一直無法正確收斂，想詢問問題出在哪？推測可能的問題：

資料集太有挑戰性：資料集目前用的是EuRoC MAV dataset，比起DeepVO、VINet等論文使用的KITTI VO Dataset更有挑戰性(因為KITTI車載影像不太會有上下方向晃動)，網路無法正確學習到相機移動
Loss function：在HTLife/VINet main.py 中(https://github.com/HTLife/VINet/blob/master/main.py#L210)，目前以L1Loss作為計算方式，加總[1]中提到的兩種loss，這裡我的理解可能不夠充份而實作錯誤。
其它超參數問題

Apr 11 '18 13:04 HTLife

How long to train the EuRoc dataset? Firstly, I try VINet and I found it difficult to converge than I try the DeepVo. It also can't converge. The loss function I use L = pose error + 100 * angle error.

Apr 17 '18 05:04 lrxjason

@lrxjason current version 60d21c7 won't converge. Which DeepVO GitHub repo did you use?

This implementation of VINet is still under tuning. The current version 60d21c7 still have some part didn't follow the original paper.
Now I'm trying to convert (x y z quaternion ) format to se3(3) and change the loss function, see if this works.

networkdetail

Apr 17 '18 13:04 HTLife

@HTLife I use this https://github.com/sladebot/deepvo/tree/master/src , but I change the training batch part.

I try both the flownetS and CNN. Both of them do not converge.

For the angle, I try pitch roll yaw and qw qx qy qz. But I get the same un-convergence result. And for training 4000 images, I need almost 1 hour for 1 epoch.

Apr 18 '18 03:04 lrxjason

As VINet mentioned, they found it hard to converge by simply training on the frame to frame pose change. They take accumulated global pose (pose related to the starting point) into account to help NN to converge. I'll try this idea tomorrow, and verify if this work or not.

@lrxjason Do you have other suggestion to help this network converge?

Apr 18 '18 12:04 HTLife

I try just use CNN part( ignore the LSTM part). It also doesn't converge. I will try to use the PoseNet to train the database tomorrow.

I'm confused about the global pose. If the current camera position is far from the start point which means there is no overlap area how could the global pose work?

From the hardware, I can only set the timestep to 10 which means I can only estimate 10 frames pose. Otherwise, the GPU will out of memory. @HTLife

Apr 24 '18 23:04 lrxjason

About global pose

I think they are not using global pose directly. Instead, they use the "difference" between global pose and accumulated pose. This loss design might reduce the "drift" on estimation.

On page 3999 _042818_011851_am

Code update

I just upload a version with complete SE3 implementation (SHA: 4f14be7bd5a5163dc0b9a41e1ffa9473f5817758).

This is the implementation detail: Because of the xyzQuaternion feedback design, now the training can only do SGD with batch size 1.

I'm now training with this new implementation and adjusting loss detail.

By the way, @lrxjason did you got any progress on PoseNet?

Apr 27 '18 17:04 HTLife

"SE3 composition layer" been mentioned in VINet might be related to gvnn. Since the PyTorch implementation is not available (gvnn is implement in torch), I replace the "SE3 composition layer" by SE3 Matrix multiplication.

VINet did not mention the detail of SE3 composition layer, but the related description could be found in [1] (publish by same Lab).

I do not understand the difference between training "directly on se3" and "SE3", and how would that affect the convergence of training.

[1] S.Wang, R.Clark, H.Wen, andN.Trigoni, “End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks,” Int. J. Rob. Res., 2017.

Apr 29 '18 23:04 HTLife

@HTLife I have look at the paper and your code again. It seems like the short batch trajectory structure is important as described in Figure4 of the original paper. SE3 accumulate error as time goes and it will hinder the convergence. It seems like they divided the trajectory into short fragments with different initial poses. Also to keep the continuity they are passing RNN hidden state of the previous batch to the next batch. Maybe it could be the cause of the problem. What do you think?

By the way, how are you debugging the code in the docker? I started python and docker a few days ago. Any advice will be appreciated.

May 03 '18 23:05 copark86

@copark86 I haven't noticed about passing RNN hidden state to next batch! That might be important. Also, I'll find some time to replace "matrix multiplication implementation of acc SE3" with "SE3 composition layer". See if this works.

May 06 '18 03:05 HTLife

@HTLife I just found that read_R6TrajFile is reading relative pose as 1x6 vector whereas the saved relative pose file is actually written in quaternion format. Is sampleGT_relative.py right code to generate relative pose?

May 08 '18 03:05 copark86

@HTLife Regarding the input of the IMU RNN, I guess it should include every IMU data between two image frames but it seems like your code is loading only 5 nearby IMU data. Please, correct me if I am wrong.

May 08 '18 04:05 copark86

About the magic number of reading IMU data

@copark86 Sorry for that magic number. The EuRoC MAV dataset synchronized the camera and imu time. There is one imu record between two images.

(Left: image timestamp, Right: IMU data.csv) _050818_020249_pm

What I did is to feed the RNN with a longer IMU sequence (arbitrary length 5) rather than 3 (image1 time, middle, image2 time).

May 08 '18 06:05 HTLife

About read_R6TrajFile dimension

@copark86
The corresponding part is been showed in the following figure: 39375670-bc711a52-4a81-11e8-9be3-18b45924d0de_050818_021121_pm

I calculate the relative pose (x y z ww wx wy wz) from absolute pose (x y z ww wx wy wz)
se3 could be represent in 6 value. Therefore, (x y z ww wx wy wz) should convert to R^6, and the output dimension of VINet should also be 6 rather than 7.

I found I read the wrong file. It should be:

self.trajectory_relative = self.read_R6TrajFile('/vicon0/sampled_relative_R6.csv')

May 08 '18 06:05 HTLife

@HTLife Regarding IMU data, camera is 20Hz and IMU is 200Hz. The training dataset has 2280 images, 22801 imu measurements which mean there should be 10 imu measurements between two consecutive images. If there are only two imu data between two images as you said, does it mean that you are pre-integrating the IMU data?

Reading sampled_relative_R6 instead of sampled_relative make perfect sense. Thanks for the explanation.

May 08 '18 07:05 copark86

@copark86 Let me redescribe it.

IMU => no sampling

We should use the raw data of IMU(x,y,z,wx,wy,wz). The IMU sequence length could be span for 8 images (arbitrary number). We have 70 IMU record between 8 images. So the magic number should change from 5 to 70.

(I assume one red box corresponding to one image) _050818_033214_pm

VICON => sampled to camera rate

The ground truth is VICON, which is 40Hz (twice the frequency of stereo camera 20FPS)

May 08 '18 07:05 HTLife

@copark86 VICON R6 conversion code is been updated. e8d72ea

See scripts/convertSampledTrajToR6.py

The original r6 file is incorrect, which convert sampled_relative from x y z ww wx wy wz into x y z so3. scripts/convertSampledTrajToR6.py will convert x y z ww wx wy wz to se3R6

May 08 '18 08:05 HTLife

IMU

@HTLife Thanks for the explanation.

VICON

I just found that the vicon time stamp and IMU stamp is not actually recorded in the same system. (IMU and Camera is in the same system.) How did you find the time difference between the vicon and IMU?

May 09 '18 02:05 copark86

@HTLife It seems like they used gravity removed acceleration. If the acceleration is not removed eq 10 and 13 is totally wrong. This is important but not mentioned in any part of the paper. What do you think?

May 11 '18 06:05 copark86

@copark86 The IMU value is not directly connected to the se3 output. The LSTM still have chances output the right value according to the input.

However, having a value closed to 0 might help the convergen speed.

_051118_031732_pm _051118_031749_pm

Also, I found that the official project page of VINet was updated. We could follow the update of slambench2, VINet might be available in this project at the future.

May 11 '18 07:05 HTLife

I found that I misunderstood how to use SE3 composition layer. _061818_030140_pm The orange line is how it should look like.

Jun 18 '18 07:06 HTLife

I think that is correct although the original figure in the paper describes the current pose is concatenated along with IMU and flownet. There is no reason to concatenate the current pose. If it is not their intention, the figure is misleading the readers.

Does it converge better now?

Months ago, I wrote a new code to regenerated the dataset as I found a possible mistake in your dataset (I will check it again and let you know. it was too long ago.).

Jun 18 '18 21:06 copark86

@copark86 I'm porting the SE3 composition layer from gvnn (lua+torch) to PyTorch.
I'll start training again after the porting is complete.

Here is the draft of porting gvnn SE3 => link

Jun 19 '18 16:06 HTLife

The complete SE3 composition layer implementation is out! (link) I'll start to merge this implementation into VINet.

Jun 27 '18 13:06 HTLife

@HTLife @copark86 Did you manage to get the network to converge with good results?

Mar 04 '19 08:03 JesperChristensen89

@JesperChristensen89 I have no time to focus on this project recently. But if you are interested in continue this work, I'll be willing to engage in the discussion.

Mar 09 '19 00:03 HTLife

@Adamquhao Cool, are you willing to share it with me? (By sending the pull request)

Jun 13 '19 10:06 HTLife

@copark86 Let me redescribe it.

IMU => no sampling

We should use the raw data of IMU(x,y,z,wx,wy,wz). The IMU sequence length could be span for 8 images (arbitrary number). We have 70 IMU record between 8 images. So the magic number should change from 5 to 70.

(I assume one red box corresponding to one image)

VICON => sampled to camera rate

The ground truth is VICON, which is 40Hz (twice the frequency of stereo camera 20FPS)

Should we resample imu so that aligned with the image?

Nov 27 '19 03:11 fangxu622

Recently, I tried to implement VINet[1] and open source it to GitHub HTLife/VINet

I already complete whole network structure, but the network can't converge properly in training. How could I fix this problem?

Possible problems & solutions:

The dataset is too challenging: I'm using the EuRoC MAV dataset, which is more challenging than the KITTI VO Dataset used by the DeepVO, Vinet(because the KITTI vehicle image does not shake up and down). NN cannot learn camera movement correctly.

Loss function: L1 loss is been used and identical to the design in [1]. (I'm not very confident about whether I understand the loss design in [1] currently.) Related code

Other hyperparameter problems

Chinese translation

VINet 實作，訓練無法收斂，我的網路設計是否有誤？最近嘗試復現了VINet[1]並將其開源至GitHub HTLife/VINet

目前己經完整整體架構，但在訓練上一直無法正確收斂，想詢問問題出在哪？推測可能的問題：

資料集太有挑戰性：資料集目前用的是EuRoC MAV dataset，比起DeepVO、VINet等論文使用的KITTI VO Dataset更有挑戰性(因為KITTI車載影像不太會有上下方向晃動)，網路無法正確學習到相機移動

Loss function：在HTLife/VINet main.py 中(https://github.com/HTLife/VINet/blob/master/main.py#L210)，目前以L1Loss作為計算方式，加總[1]中提到的兩種loss，這裡我的理解可能不夠充份而實作錯誤。%EF%BC%8C%E7%9B%AE%E5%89%8D%E4%BB%A5L1Loss%E4%BD%9C%E7%82%BA%E8%A8%88%E7%AE%97%E6%96%B9%E5%BC%8F%EF%BC%8C%E5%8A%A0%E7%B8%BD%5B1%5D%E4%B8%AD%E6%8F%90%E5%88%B0%E7%9A%84%E5%85%A9%E7%A8%AEloss%EF%BC%8C%E9%80%99%E8%A3%A1%E6%88%91%E7%9A%84%E7%90%86%E8%A7%A3%E5%8F%AF%E8%83%BD%E4%B8%8D%E5%A4%A0%E5%85%85%E4%BB%BD%E8%80%8C%E5%AF%A6%E4%BD%9C%E9%8C%AF%E8%AA%A4%E3%80%82)

其它超參數問題

Hello! I would to try the VINet. But I was sucked into the problem of the absence of the dataset. By the way, I have not ever used the docker. How can i get the EuRoC MAV dataset or another dataset like KITTI VO Dataset suited this code. Thank you! My email is [email protected].

Oct 16 '20 03:10 xuqiwe

@xuqiwe I didn't finish this work in the end. But @Adamquhao seems to successfully build train the network.

The following is his instruction:

"Firstly, i recommend u to read "Selective Sensor Fusion for Neural Visual-Inertial Odometry" whose author's department is same as VINet's. This paper leak some details about VINet-like networks. That is the VO features and IMU feature need to be at same size. You can resize the features after VO encoder using a fc layer and concatenate them together. Then u deliver the concatenated feature directly to last lstm(set a suitable length). Finally you get the 6DoF between image pairs by fc layers. The idea is very sample. there are some tricks during pretraining. You firstly pretrain DeepVO decoder(without LSTM) on kitti odometry dataset and use fixed decoder backbone in later experiments(ideas from https://github.com/linjian93/pytorch-deepvo) i did not compare the results with those in VInet, but the loss did converge and i get the suitable results(better than DeepVO only)."

Oct 18 '20 13:10 HTLife

@xuqiwe I didn't finish this work in the end. But @Adamquhao seems to successfully build train the network.

The following is his instruction:

"Firstly, i recommend u to read "Selective Sensor Fusion for Neural Visual-Inertial Odometry" whose author's department is same as VINet's. This paper leak some details about VINet-like networks. That is the VO features and IMU feature need to be at same size. You can resize the features after VO encoder using a fc layer and concatenate them together. Then u deliver the concatenated feature directly to last lstm(set a suitable length). Finally you get the 6DoF between image pairs by fc layers. The idea is very sample. there are some tricks during pretraining. You firstly pretrain DeepVO decoder(without LSTM) on kitti odometry dataset and use fixed decoder backbone in later experiments(ideas from https://github.com/linjian93/pytorch-deepvo) i did not compare the results with those in VInet, but the loss did converge and i get the suitable results(better than DeepVO only)."

Thanks a lot!

Oct 19 '20 03:10 xuqiwe

VINet VINet copied to clipboard

Training did not converge

Chinese translation

About global pose

Code update

About the magic number of reading IMU data

About read_R6TrajFile dimension

IMU => no sampling

VICON => sampled to camera rate

IMU

VICON

However, having a value closed to 0 might help the convergen speed.

IMU => no sampling

VICON => sampled to camera rate

Chinese translation

VINet
VINet copied to clipboard