monodepth2 Pose on identity transformation

Hey there,

Thanks for your work!

To check whether the method is applicable for my dataset, I've intentionally overfitted the network for one specific data sequence with a moving camera. For most of the sequence, the camera's baseline is small (rotation is close to identity and translation is also small), but sometimes it turns around by quite some angle.

However, egomotion is always very close to an identity matrix for every frame.

These two images are an example of a time when camera movement is low:

0000 0001

These two images are an example of a time when camera rotation is large: 0344 0345

Is it even possible to train the network for such sort of data?

May 13 '21 09:05 VladimirYugay

It may well be. Might be worth giving it a go!

But for two sets of images:

The first pair of images probably has too small a camera motion. Might be worth subsampling frames a little here.
The second pair of images has quite a large camera motion, but it's probably ok as long as not too many sequences are like this

It might be worth looking at the KITTI sequences (https://www.youtube.com/watch?v=KXpZ6B1YB_k) to get a sense of what types of camera motion monodepth2 works well with.

May 20 '21 14:05 mdfirman

Thanks for your response!

I've increased the baseline to simulate visual movement like in KITTI, sampled -5, 0, 5.

Before starting training, I've tried to overfit on a small subset of data, by taking only 5 triplets of images from the same sequence with quite some egomotion. I've trained from scratch (except encoders) and got the following disparity maps

0045

The lines seem very accurate, but depth is far from being close to the truth. When the real depth is in meters ranging from 0 to 80, the depth obtained from the disparity map above with disp_to_depth is very flat, almost constant all over the image, ranging from 1.9 to 1.92.

Have you seen similar behavior before or may any hints on how to make the training more stable?

May 27 '21 20:05 VladimirYugay

I haven't seen things quite like this before – it looks quite strange.

Have you adjusted the intrinsics in the dataloader to reflect these intrinsics? This is important to do otherwise the training isn't going to work.

Separately:

Are you able to render stereo pairs from this dataset? If so it is much easier to debug intrinsics, dataloading and training with a stereo dataset. Then once that works you can switch to the harder task of mono training.

Jun 02 '21 09:06 mdfirman

Yes, sure, I've set it up in my datalodaer

        self.K = np.array([[1158, 0, 960, 0], [0, 1158, 540, 0], [0, 0, 1, 0], [0, 0, 0, 1]], dtype=np.float32)

What do you mean under stereo pairs? I do not really have any stereo pairs in my dataset, it's monocular.

Jun 02 '21 09:06 VladimirYugay

Ah – it looks like you might be using unnormalised intrinsics. Take a look at the KITTI intrinsics we use and the comment above:


        # NOTE: Make sure your intrinsics matrix is *normalized* by the original image size.
        # To normalize you need to scale the first row by 1 / image_width and the second row
        # by 1 / image_height. Monodepth2 assumes a principal point to be exactly centered.
        # If your principal point is far from the center you might need to disable the horizontal
        # flip augmentation.
        self.K = np.array([[0.58, 0, 0.5, 0],
                           [0, 1.92, 0.5, 0],
                           [0, 0, 1, 0],
                           [0, 0, 0, 1]], dtype=np.float32)

We would expect the numbers in your normalised K to be around 1.0, rather than around 1000.

On stereo pairs: That's ok – if you don't have stereo pairs then that's fine. (If you did have them, it can be really useful to help to debug)

Jun 02 '21 10:06 mdfirman

Thanks for the hint. Tried that, but even after that scaling the intrinsic properly I have almost the same predictions and training metrics. The loss itself is low and doesn't decrease (you can ignore depth_rmse).

Screenshot from 2021-06-02 14-43-13

I've also tried to train only on three triplets from KITTI dataset to check whether I can overfit on them with frame ids 0, -1, 1.

Screenshot from 2021-06-02 17-08-39

The results looks somewhat similar to what I got with my own dataset (you can ignore depth_rmse and this is not disparity, but depth)

0_000000

Jun 02 '21 12:06 VladimirYugay

Hi, Could you try using a KITTI pretrained model on your gta V data? Usually if the intrinsics are good the model should be able to fine tune on the new data (as long as the relative poses aren't too difficult to learn).

Jun 03 '21 00:06 mrharicot

Yes, tried it. More or less the same obscure depth maps. I've also tried visualizing prediction and target in the compute_loss after training for 20 epochs (warped images from -10 and 10 frames vs target image):

0_-10 0_10

My current problem seems very similar to this issue, but no answer there. Also, I've seen quite some comments of a similar nature: loss is around 0.13 and just doesn't converge.

Jun 03 '21 19:06 VladimirYugay

What are the intrinsics after normalization?

Jun 03 '21 19:06 daniyar-niantic

Originally, for images of size 1080x1920 the intrinsics are:

  self.K = np.array([
                     [1158, 0,  960, 0],
                     [0,  1158, 540, 0],
                     [0, 0, 1, 0],
                     [0, 0, 0, 1]], dtype=np.float32)

I'm resizing the images for training to 576x960, and scale intrinsics accordingly:

[[579.    0.      480.   0. ]
 [0.     617.6    288.   0. ]
 [  0.    0.       1.    0. ]
 [  0.    0.       0.    1. ]]

Finally, after normalizing (this one is used throughout training):

[[0.603 0.    0.5   0.   ]
 [0.    1.072 0.5   0.   ]
 [0.    0.    1.    0.   ]
 [0.    0.    0.    1.   ]]

Jun 03 '21 20:06 VladimirYugay

Thanks for these – these intrinsics look like they are in a reasonable range, I think. (BTW where did you get the focal lengths for this GTA data from?)

Do the depths from the pretrained KITTI model look reasonable? If so, you can use the KITTI model to check if everything is set up correctly without needing to retrain:

Start training with the KITTI weights loaded
Verify the depth predictions look like some form of reasonable scene shape (disp_0 tab in tensorboard)
Check the reprojection image (color_pred_* tab.) - this should look like a reasonable reconstructed image (not the brown stripy-line images you shared above)
You don't need to run any long training to check this – just the very first plot is enough, after about a minute or so of running the training code.

If it looks bad, some things to check:

The intrinsics (again!). I know you have done this twice but this is honestly where 40% of mistakes happen. Check where the focal lengths come from, and that you trust that source of information.
Check you haven't accidently made any other modifications to the codebase other than in the dataloader.
Check how spaced apart are the image sequences (open a color_0_0 image from tensorboard in a new tab, and also a color_*_0 image in a separate tab. Flick between them. Does the distance the camera travels look similar to the distance travelled when you do the same for the KITTI dataset?)
I would disable flip augmentation for this debugging step
Check that the network is definitely using the output of the pose network, and isn't using the extrinsics provided by the dataloader (these blocks of code should never be run: this and this.
Check the run command is basically what we provide in the README for monocular training.

Overall, for this sort of debugging, tensorboard is your friend. Check you can get reasonable reprojected images. And you can do this without retraining.

Let us know how you get on!

Jun 04 '21 08:06 mdfirman

Thanks for such an elaborate response! It definitely helped me a lot!

Intrinsics are taken from here. They're 100% correct, I've constructed point clouds with them and they look really nice
I just wiped out the whole repo and started from "scratch" again. The loss finally started to go down, I disabled flip, kept the KITTI image size, considered both KITTI and GTA datasets separately, and checked the following options:

Overfit from the checkpoint, the loss goes down, the depth looks reasonable, the edges become sharper as the training progress
Overfit from scratch with weights init from --weights_init 'scratch'. The loss goes down and stagnates around 0.07, but the depth estimation doesn't make any sense, it's pretty much dead. The reprojected image is close to the target. Did that for 100 epochs just to be sure
Overfit from scratch with weights init from --weights_init 'pretrained'. The exactly same behavior as with scratch

Maybe this happens because the problem itself has infinitely many solutions

I actually have the ground truth depth maps in meters and while the predicted depth visually looks fine even after overfitting is not on the same scale. For example, the predicted value is 2, while the real one is 11. In the repo you only scale the depth for the stereo setup where the transformation between cameras is known. Is there any way to do this having the gt depth maps but in a monocular setup?

Jun 07 '21 20:06 VladimirYugay

Ok great – so it works when starting from KITTI, but it's the scratch training which is still the problem?

Might be worth trying with the --v1_multiscale option, to see how that helps. Not quite sure why it would help but worth trying.

Do you have any more images from your dataset to share? I wonder if the types of textures in them might just be especially problematic.

Jun 08 '21 09:06 mdfirman

It works starting from kitti, visually it looks good, but quantitatively not really for GTA.

Yes, would be great to have the ability to train it from scratch.

Well, I can't overfit even on Kitti tracking left triplet, so maybe the texture is not a problem. I've just taken 000000.png, 000001.png, 000002.png for 0000 sequence. The loss goes down but the disparity doesn't look good. The only thing I've changed was the get_image_path in KITTIRAWDataset:

Screenshot from 2021-06-08 16-58-01

Screenshot from 2021-06-08 17-00-58

Jun 08 '21 15:06 VladimirYugay

Yes I would expect that this would be a problem to overfit to small sections of KITTI – this type of self-supervised training really benefits from seeing lots of different images, to help the network get out of local minima.

How many GTA images are you training from?

Jun 08 '21 15:06 mdfirman

Ok, I see, maybe it's the same for GTA. On GTA I also tried only 3 triplets, 9 images in total. But in general, I have around a million.

What do you think might be a reasonable number of triplets to overfit on?

Jun 08 '21 15:06 VladimirYugay

Ah ha! This could be the problem, I'd hope.

I'd take a similar number as are in the KITTI dataset (I forget exactly how many)

But maybe at least 10k triplets?

Jun 08 '21 15:06 mdfirman

Please do let us know if you have any progress – we'd love to see some GTA-trained depth maps!

Jun 14 '21 12:06 mdfirman

@mdfirman yes, of course, I'm currently training, will post as soon as ready!

Jun 14 '21 12:06 VladimirYugay

@mdfirman I've finally finished my experiments. Here are some outputs in case someone is interested.

TLDR:

The model fails to overfit on a smaller data sample and need a certain chunk of triplets (8k in my case)
Silhouettes on the depth maps look fine, but the scale is off
Best option is to start from a checkpoint
The training itself is not really stable
Egomotion quality is not tested visually yet

Some details on the dataset. The dataset is not really a GTA dataset, but rather a GTA dataset for specific scenarios where we have a lot of sequences and really crowded scenes viewed from the pedestrian's viewpoint. This dataset should be published soon and currently is in pre-print.

The dataset contains both dynamic and static camera sequences and has a really small baseline. I've sampled only dynamic sequences like took every 5th image to form the triplets to simulate the movement in KITTI. I've also increased the resolution to 540x960. For the first shot, I've selected only 8k images in train and 1k images in validation. In the dataset, I also have the ground-truth depth and egomotion, so I was able to compute losses in the validation step for them.

I tried the following things:

Method Depth RMSE Egomotion L1

8k_hr_pretrained (20 epochs) 24.49 0.02

8k_hr_pretrained_5 (20 epochs) 20.74 0.13

8k_hr (30 epochs, init with imagenet) 24.87 0.02

8k_hr_5 (30 epochs, init with imagenet) 22.62 0.11

8k_hr_5 (30 epochs, scratch) 15.6 (Dead) 0.11 (Dead)

Method	Depth RMSE	Egomotion L1
8k_hr_pretrained (20 epochs)	24.49	0.02
8k_hr_pretrained_5 (20 epochs)	20.74	0.13
8k_hr (30 epochs, init with imagenet)	24.87	0.02
8k_hr_5 (30 epochs, init with imagenet)	22.62	0.11
8k_hr_5 (30 epochs, scratch)	15.6 (Dead)	0.11 (Dead)

Here 8k is for the number of images, hr stands for high resolution, '_5` is for the window in which we take the images (in this case -5, 0, 5). Obviously, the smaller the window the smaller the egomotion error.

Despite training from the very scratch without imagenet initialization gives the smallest error, the depth was completely flat, close to 0 everywhere and the reconstructed images looked fine. The options with the pre-trained model worked best and it was selected to train on larger dataset size.

I trained the model with checkpoint initialization with -5, 0, 5 frame ids on 50k images and 12k images in the validation set for 50 epochs. The results for the depth silhouettes look fine, but the scale is very different. For instance, in the case where we have a depth equal to 17, it's around 3 in the prediction. Moreover, we have an issue of infinite depth, but this was already discussed in the issues here and I'm thinking of moving here for this particular issue to utilize segmentation masks we also have.

Regarding the losses, the training doesn't look stable. I'm interested in whether you also had a similar loss/metric behavior:

Screenshot from 2021-06-27 13-24-10

Depth metrics:

Screenshot from 2021-06-27 13-24-23

Egomotion metrics:

Screenshot from 2021-06-27 13-29-38

Jun 27 '21 11:06 VladimirYugay

Thanks for reporting back!

Jun 07 '23 17:06 daniyar-niantic

monodepth2 monodepth2 copied to clipboard

Pose on identity transformation

monodepth2
monodepth2 copied to clipboard