VLocNet A discussion on the performance and share your result

I opened this issue so that I hope people can share their best results and see if we can figure out how to reproduce VLocNet paper result together.

My trained heads scene result using the VLocNet-M4 setting in this repo: train.sh.txt

Median error 0.40m, 36.018 degree Mean error 0.469m, 39.855 degree

Paper reported result for heads: Median error 0.046m, 6.64 degree Mean error Unknown

However, if I test my model with feeding in the previous pose with the ground truth value (which is CHEATING), I got: Median error 0.02m, 3.25 degree Mean error 0.02m, 3.84 degree

I notice that the error of the predicted pose will accumulate and propagate at testing time if I use network predicted pose as previous pose. Does anyone know why this happened and how to solve it?

Apr 28 '21 14:04 chenusc11

I got the same median error and mean error that you got on the heads scene. I wrote the entire model in pytorch. And yeah, we can't use the ground truth pose when evaluating it.. :/

I noticed increasing errors in my predicted pose as well. I think scale drift is what caused this. As the camera moves, the scale of the environment changes- objects in the scene change size, distances between objects decrease, etc, and the error in our predictions accumulate over time. Possible solutions involve using geometry and deep learning, which is exactly what the paper has done... :/

Maybe I'm glossing over something from the paper.

Apr 28 '21 14:04 srinathvrao

I also experienced the error-accumulating issue and according to my simple mind, this couldn't be avoided in the VLocNet. I haven't found any module in the model which might be helpful for error correction. I sent emails to the author of this paper and he told me to have a look at VLocNet ++ which was all the way better than VLocNet.

Apr 28 '21 15:04 wang422003

I also experienced the error-accumulating issue and according to my simple mind, this couldn't be avoided in the VLocNet. I haven't found any module in the model which might be helpful for error correction. I sent emails to the author of this paper and he told me to have a look at VLocNet ++ which was all the way better than VLocNet.

Thank you for your comment! And I appreciate that your Keras code was a nice reference to my own effort in re-implementing the VLocNet.

Did the author give you any suggestions on reproducing the VLocNet or VLocNet++ paper result? Do you think it is more likely to reproduce their results with VLocNet++ than VLocNet? After all, VLocNet++ is more complicate to implement and harder to train (I suspect)

Apr 28 '21 15:04 chenusc11

I got the same median error and mean error that you got on the heads scene. I wrote the entire model in pytorch. And yeah, we can't use the ground truth pose when evaluating it.

Thank you for your feedback as well. My issue with my implementation is that I can reproduce the result of VlocNet on 7-scenes up to their VLocNet-M2 model, which is without the recursive pose structure.

However, it seems like the paper claims the biggest improvement is between VLocNet-M2 and VLocNet-M3, which I'm currently unable to reproduce with my M3 implementation

Apr 28 '21 15:04 chenusc11

I also experienced the error-accumulating issue and according to my simple mind, this couldn't be avoided in the VLocNet. I haven't found any module in the model which might be helpful for error correction. I sent emails to the author of this paper and he told me to have a look at VLocNet ++ which was all the way better than VLocNet.

Thank you for your comment! And I appreciate that your Keras code was a nice reference to my own effort in re-implementing the VLocNet.

Did the author give you any suggestions on reproducing the VLocNet or VLocNet++ paper result? Do you think it is more likely to reproduce their results with VLocNet++ than VLocNet? After all, VLocNet++ is more complicate to implement and harder to train (I suspect)

You are welcome. That code is actually my early attempt for my master's final project. Please ignore the naive implementation. XD

The author only told me to have a look at the "Adaptive Weight Fusion" layers in the VLocNet++. I asked about the exact implementation of either VLocNet or VLocNet++, he declined my request due to the IP restrictions.

According to my point of view, VLocNet++ is much more challenging to train, but I think it can produce better results compared with VLocNet.

Apr 28 '21 15:04 wang422003

I also experienced the error-accumulating issue and according to my simple mind, this couldn't be avoided in the VLocNet. I haven't found any module in the model which might be helpful for error correction. I sent emails to the author of this paper and he told me to have a look at VLocNet ++ which was all the way better than VLocNet.

Thank you for your comment! And I appreciate that your Keras code was a nice reference to my own effort in re-implementing the VLocNet. Did the author give you any suggestions on reproducing the VLocNet or VLocNet++ paper result? Do you think it is more likely to reproduce their results with VLocNet++ than VLocNet? After all, VLocNet++ is more complicate to implement and harder to train (I suspect)

You are welcome. That code is actually my early attempt for my master's final project. Please ignore the naive implementation. XD

The author only told me to have a look at the "Adaptive Weight Fusion" layers in the VLocNet++. I asked about the exact implementation of either VLocNet or VLocNet++, he declined my request due to the IP restrictions.

According to my point of view, VLocNet++ is much more challenging to train, but I think it can produce better results compared with VLocNet.

Ok, understood. Thank you :)

Apr 28 '21 15:04 chenusc11

I opened this issue so that I hope people can share their best results and see if we can figure out how to reproduce VLocNet paper result together.

My trained heads scene result using the VLocNet-M4 setting in this repo: train.sh.txt

Median error 0.40m, 36.018 degree Mean error 0.469m, 39.855 degree

Paper reported result for heads: Median error 0.046m, 6.64 degree Mean error Unknown

However, if I test my model with feeding in the previous pose with the ground truth value (which is CHEATING), I got: Median error 0.02m, 3.25 degree Mean error 0.02m, 3.84 degree

I notice that the error of the predicted pose will accumulate and propagate at testing time if I use network predicted pose as previous pose. Does anyone know why this happened and how to solve it?

Hello, I am recently reproduce the VLocNet and I met the same problem, may I know did you find or resolve this problem? Thank you!

Jul 17 '21 10:07 QLuanWilliamed

I opened this issue so that I hope people can share their best results and see if we can figure out how to reproduce VLocNet paper result together. My trained heads scene result using the VLocNet-M4 setting in this repo: train.sh.txt Median error 0.40m, 36.018 degree Mean error 0.469m, 39.855 degree Paper reported result for heads: Median error 0.046m, 6.64 degree Mean error Unknown However, if I test my model with feeding in the previous pose with the ground truth value (which is CHEATING), I got: Median error 0.02m, 3.25 degree Mean error 0.02m, 3.84 degree I notice that the error of the predicted pose will accumulate and propagate at testing time if I use network predicted pose as previous pose. Does anyone know why this happened and how to solve it?

Hello, I am recently reproduce the VLocNet and I met the same problem, may I know did you find or resolve this problem? Thank you!

Hi, after some further investigation, I think that the VLocNet paper result is hardly reproducible (at least for me). Mainly because of a few reasons:

As @wang422003 has mentioned earlier, VLocNet does not have a mechanism to prevent the drifting issue.
The paper claims to be on par with structure-based methods. VLocNet paper indeed reports very good performance on 7-scenes (which is better than DSAC). However, it is a bit suspicious to me that VLocNet also reported Cambridge results in worse accuracy compare to DSAC. And in VLocNet++, it didn't report the Cambridge result at all. This damages my faith toward this paper.

Finally, I found some interesting observations on the VLocNet's Odometry Network. I did a controlled experiment comparing VLocNet (my own implementation and this repo's implementation) with DeepVO on the KITTI dataset. DeepVO ends up with a better performance in terms of less drifting in VO.

The key is that in [Line 205-216](https://github.com/ChiWeiHsiao/DeepVO-pytorch/blob/master/data_helper.py),

you change relative pose w.r.t. the first frame in the sequence
then compute the relative pose between each frame

After adding this part into my VLocNet's Odometry Network training, I could get an on par Odometry Network result on KITTI compare to DeepVO.

DeepVO result (green GT, blue predict): Screen Shot 2021-07-17 at 4 23 44 PM

Mine result using my VLocNet's Odometry Network implementation (blue GT, green predict): Screen Shot 2021-07-17 at 4 24 05 PM

Jul 17 '21 15:07 chenusc11

Hi,

Thanks for your reply and your detailed explanation. I will try with your suggestions. BTW, if possible could you share your revised code of VLocNet. Have a nice weekend!

kind regards, William

Jul 17 '21 16:07 QLuanWilliamed

Hi,

Thanks for your reply and your detailed explanation. I will try with your suggestions. BTW, if possible could you share your revised code of VLocNet. Have a nice weekend!

kind regards, William

Hi, unfortunately, for now, my own implementation is intertwined with some other project which was still under a non-disclosure agreement. I would suggest that you can try to build upon this repo's code which is a decent start.

But feel free to drop a message/email if you have further discussions.

Jul 17 '21 20:07 chenusc11

Hi，thanks for your advice, keep in touch!

Jul 17 '21 20:07 QLuanWilliamed

I also experienced the error-accumulating issue and according to my simple mind, this couldn't be avoided in the VLocNet. I haven't found any module in the model which might be helpful for error correction. I sent emails to the author of this paper and he told me to have a look at VLocNet ++ which was all the way better than VLocNet.

Hi, I am recently reproducing the VLocNet and I met the same problem, may I know have you resolved this problem? Thank you!

Jul 20 '21 20:07 QLuanWilliamed

I opened this issue so that I hope people can share their best results and see if we can figure out how to reproduce VLocNet paper result together.

My trained heads scene result using the VLocNet-M4 setting in this repo: train.sh.txt

Median error 0.40m, 36.018 degree Mean error 0.469m, 39.855 degree

Paper reported result for heads: Median error 0.046m, 6.64 degree Mean error Unknown

However, if I test my model with feeding in the previous pose with the ground truth value (which is CHEATING), I got: Median error 0.02m, 3.25 degree Mean error 0.02m, 3.84 degree

I notice that the error of the predicted pose will accumulate and propagate at testing time if I use network predicted pose as previous pose. Does anyone know why this happened and how to solve it?

Hi, I am not clear about what's " if I test my model with feeding in the previous pose with the ground truth value (which is CHEATING)". Does it refer to the following code (vlocnet.py):

    if (self.recur_pose in ['cat', 'add']):
        recur_features = self.global_previous_pose_fc(pose_p)
        recur_features = F.elu(recur_features)
        recur_features = recur_features.view(-1, 1024, 14, 14)

        if (self.recur_pose == 'cat'):
            out3 = torch.cat([out3, recur_features], dim=1)
        elif (self.recur_pose == 'add'):
            out3 = out3 + recur_features

    elif (self.recur_pose == 'adapt_fusion'):
        cur_size = list(out3.size())  # NTxCxHxW
        out3 = out3.view(s[0], s[1], *cur_size[1:])  # also for check
        recur_features = torch.cat([torch.cat([out3[:, 0:1], out3[:, :-1]], dim=1), out3], dim=2)

        cur_size[1] *= 2
        recur_features = recur_features.view(*cur_size)
        recur_features = self.global_adapt_fusion(recur_features)
        out3 = recur_features
    elif (self.recur_pose == ''):
        pass
    else:
        raise ValueError('Invalide recur_pose option!')

I guess your mean is that when we test the model, we can't use (self.recur_pose == 'cat') to concatenate the predicted poses with the ground-truth posse, otherwise, it's a cheating operation. Actually,I am confused about use (self.recur_pose == 'cat') during training the model. Do you have any idea about it?

Jul 29 '21 16:07 QLuanWilliamed

In VLocNet paper section III-A, "we feed the previous pose (groundtruth pose during training and predicted pose during evaluation)"

So first part of your question is correct. At test time, you are supposed to feed in the previous predicted pose P'_t-1 and fuse with the current predicted pose P'_t

The second question about (self.recur_pose == 'cat'). I think this is correspondent to VlocNet-M3 and VLocNet-M4 which have previous pose fusion. However, the paper didn't specify how it implements the previous pose fusion, that's why the authors of this repo implemented cat, add, and adapt_fusion.

When training the M3, M4 models, it makes sense to feed in the previous frame ground-truth pose P_{(t-1)_gt} and fuse with the current predicted pose P'_t.

Aug 01 '21 13:08 chenusc11

In VLocNet paper section III-A, "we feed the previous pose (groundtruth pose during training and predicted pose during evaluation)"

So first part of your question is correct. At test time, you are supposed to feed in the previous predicted pose P't-1 and fuse with the current predicted pose P't

The second question about (self.recur_pose == 'cat'). I think this is correspondent to VlocNet-M3 and VLocNet-M4 which have previous pose fusion. However, the paper didn't specify how it implements the previous pose fusion, that's why the authors of this repo implemented cat, add, and adapt_fusion.

When training the M3, M4 models, it makes sense to feed in the previous frame ground-truth pose P(t-1)_gt and fuse with the current predicted pose P't.

Hi, Thanks for your response. If I need to feed in the previous predicted pose in testing. Do you know how to tackle the first testing data, which have no previous predicted pose? BTW, when I check the code: if (self.recur_pose in ['cat', 'add']): recur_features = self.global_previous_pose_fc(pose_p) recur_features = F.elu(recur_features) recur_features = recur_features.view(-1, 1024, 14, 14)

    if (self.recur_pose == 'cat'):
        out3 = torch.cat([out3, recur_features], dim=1)
    elif (self.recur_pose == 'add'):
        out3 = out3 + recur_features

elif (self.recur_pose == 'adapt_fusion'):
    cur_size = list(out3.size())  # NTxCxHxW
    out3 = out3.view(s[0], s[1], *cur_size[1:])  # also for check
    recur_features = torch.cat([torch.cat([out3[:, 0:1], out3[:, :-1]], dim=1), out3], dim=2)

    cur_size[1] *= 2
    recur_features = recur_features.view(*cur_size)
    recur_features = self.global_adapt_fusion(recur_features)
    out3 = recur_features
elif (self.recur_pose == ''):
    pass
else:
    raise ValueError('Invalide recur_pose option!')

We can see that this code feed in the present ground truth pose rather than previous ground truth pose at training? Do you think that affects the results?

Aug 09 '21 16:08 QLuanWilliamed

Hi, @QLuanWilliamed, if you want, we could try to set up a repo to collaborate on reproducing this paper.

The following is my answer:

"Do you know how to tackle the first testing data, which have no previous predicted pose?" --- I think I did feed the 0th frame GT pose in it for the simplicity of its implementation. Or you could try feed 0th frame twice as prev_img and cur_img to get the first previous predicted pose.

"We can see that this code feed in the present ground truth pose rather than previous ground truth pose at training?" --- I think it only makes sense to feed in the previous GT pose at training.

Aug 10 '21 04:08 chenusc11

I also experienced the error-accumulating issue and according to my simple mind, this couldn't be avoided in the VLocNet. I haven't found any module in the model which might be helpful for error correction. I sent emails to the author of this paper and he told me to have a look at VLocNet ++ which was all the way better than VLocNet.

Hi, I am recently reproducing the VLocNet and I met the same problem, may I know have you resolved this problem? Thank you!

Sorry for the late reply. I just came back from my vacation. Actually, I worked on this model just to test its real performance, since our team suspected its high accuracy. Most of them worked on odometry at that time and all of them suffered from error accumulation. LOL Unfortunately, we failed to reproduce the results mentioned in the paper, maybe we missed some details, or we were not that lucky to reach the global minimum.

Aug 10 '21 08:08 wang422003

I also experienced the error-accumulating issue and according to my simple mind, this couldn't be avoided in the VLocNet. I haven't found any module in the model which might be helpful for error correction. I sent emails to the author of this paper and he told me to have a look at VLocNet ++ which was all the way better than VLocNet.

Hi, I am recently reproducing the VLocNet and I met the same problem, may I know have you resolved this problem? Thank you!

Sorry for the late reply. I just came back from my vacation. Actually, I worked on this model just to test its real performance, since our team suspected its high accuracy. Most of them worked on odometry at that time and all of them suffered from error accumulation. LOL Unfortunately, we failed to reproduce the results mentioned in the paper, maybe we missed some details, or we were not that lucky to reach the global minimum.

Thanks, I met the same problem about the performance.

Aug 10 '21 09:08 QLuanWilliamed

Hi, @QLuanWilliamed, if you want, we could try to set up a repo to collaborate on reproducing this paper. The following is my answer: "Do you know how to tackle the first testing data, which have no previous predicted pose?" --- I think I did feed the 0th frame GT pose in it for the simplicity of its implementation. Or you could try feed 0th frame twice as prev_img and cur_img to get the first previous predicted pose. "We can see that this code feed in the present ground truth pose rather than previous ground truth pose at training?" --- I think it only makes sense to feed in the previous GT pose at training.

Hi, that is a good idea. May I know how can we start to reproduce it?

Drop me an email @ [email protected]. We could set up a quick chat via Zoom or Slack or something

Aug 10 '21 13:08 chenusc11

Drop me an email @ [email protected]. We could set up a quick chat via Zoom or Slack or something

Recently, I also tried to reproduce the results in the original paper. As mentioned in discussion, I modified the code that calculates VO pose (instead calculated in the global frame, now calculated in I(t-1) reference frame).

Besides, I also modified the code about recur_train. The repo code training on the consecutive sequence samples (training samples consist of three images, and the input previous frame pose is dependent on the last model prediction). Because the training process needs to be consecutive, the batch size is limited (unless sampling many pseudo sequences from the original image sequence in the dataset, e.g. split long seq-01 of scene head into many short sequences, but it's a little bit complicated).

So I sample two consecutive frames and add gaussian noise to the ground truth pose of I(t-1), then noised pose can be the pseudo predicted pose of the previous frame, and I use batch_size=16. Now I got the error on the scene head around 0.14m, 15°, it still has a gap compared to the results in the original paper.

Do you have any new results and findings? I'm very glad to join your discussion. If you don't mind, I hope to contact you guys by email posted above.

Oct 04 '21 09:10 ez4lionky

Drop me an email @ [email protected]. We could set up a quick chat via Zoom or Slack or something

Recently, I also tried to reproduce the results in the original paper. As mentioned in discussion, I modified the code that calculates VO pose (instead calculated in the global frame, now calculated in I(t-1) reference frame).

Besides, I also modified the code about recur_train. The repo code training on the consecutive sequence samples (training samples consist of three images, and the input previous frame pose is dependent on the last model prediction). Because the training process needs to be consecutive, the batch size is limited (unless sampling many pseudo sequences from the original image sequence in the dataset, e.g. split long seq-01 of scene head into many short sequences, but it's a little bit complicated).

So I sample two consecutive frames and add gaussian noise to the ground truth pose of I(t-1), then noised pose can be the pseudo predicted pose of the previous frame, and I use batch_size=16. Now I got the error on the scene head around 0.14m, 15°, it still has a gap compared to the results in the original paper.

Do you have any new results and findings? I'm very glad to join your discussion. If you don't mind, I hope to contact you guys by email posted above.

Hi @ez4lionky,

Thanks for your interest in reproducing this project, you can drop me an email at [email protected], and then I will create a channel on Slack.

Oct 04 '21 09:10 LZL-CS

VLocNet VLocNet copied to clipboard

A discussion on the performance and share your result

VLocNet
VLocNet copied to clipboard