monodepth2
monodepth2 copied to clipboard
Pose on identity transformation
Hey there,
Thanks for your work!
To check whether the method is applicable for my dataset, I've intentionally overfitted the network for one specific data sequence with a moving camera. For most of the sequence, the camera's baseline is small (rotation is close to identity and translation is also small), but sometimes it turns around by quite some angle.
However, egomotion is always very close to an identity matrix for every frame.
These two images are an example of a time when camera movement is low:
These two images are an example of a time when camera rotation is large:
Is it even possible to train the network for such sort of data?
It may well be. Might be worth giving it a go!
But for two sets of images:
-
The first pair of images probably has too small a camera motion. Might be worth subsampling frames a little here.
-
The second pair of images has quite a large camera motion, but it's probably ok as long as not too many sequences are like this
It might be worth looking at the KITTI sequences (https://www.youtube.com/watch?v=KXpZ6B1YB_k) to get a sense of what types of camera motion monodepth2 works well with.
Thanks for your response!
I've increased the baseline to simulate visual movement like in KITTI, sampled -5, 0, 5.
Before starting training, I've tried to overfit on a small subset of data, by taking only 5 triplets of images from the same sequence with quite some egomotion. I've trained from scratch (except encoders) and got the following disparity maps
The lines seem very accurate, but depth is far from being close to the truth. When the real depth is in meters ranging from 0 to 80, the depth obtained from the disparity map above with disp_to_depth
is very flat, almost constant all over the image, ranging from 1.9 to 1.92.
Have you seen similar behavior before or may any hints on how to make the training more stable?
I haven't seen things quite like this before – it looks quite strange.
Have you adjusted the intrinsics in the dataloader to reflect these intrinsics? This is important to do otherwise the training isn't going to work.
Separately:
Are you able to render stereo pairs from this dataset? If so it is much easier to debug intrinsics, dataloading and training with a stereo dataset. Then once that works you can switch to the harder task of mono training.
Yes, sure, I've set it up in my datalodaer
self.K = np.array([[1158, 0, 960, 0], [0, 1158, 540, 0], [0, 0, 1, 0], [0, 0, 0, 1]], dtype=np.float32)
What do you mean under stereo pairs? I do not really have any stereo pairs in my dataset, it's monocular.
Ah – it looks like you might be using unnormalised intrinsics. Take a look at the KITTI intrinsics we use and the comment above:
# NOTE: Make sure your intrinsics matrix is *normalized* by the original image size.
# To normalize you need to scale the first row by 1 / image_width and the second row
# by 1 / image_height. Monodepth2 assumes a principal point to be exactly centered.
# If your principal point is far from the center you might need to disable the horizontal
# flip augmentation.
self.K = np.array([[0.58, 0, 0.5, 0],
[0, 1.92, 0.5, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]], dtype=np.float32)
We would expect the numbers in your normalised K
to be around 1.0, rather than around 1000.
On stereo pairs: That's ok – if you don't have stereo pairs then that's fine. (If you did have them, it can be really useful to help to debug)
Thanks for the hint. Tried that, but even after that scaling the intrinsic properly I have almost the same predictions and training metrics. The loss itself is low and doesn't decrease (you can ignore depth_rmse).
I've also tried to train only on three triplets from KITTI dataset to check whether I can overfit on them with frame ids 0, -1, 1.
The results looks somewhat similar to what I got with my own dataset (you can ignore depth_rmse and this is not disparity, but depth)
Hi, Could you try using a KITTI pretrained model on your gta V data? Usually if the intrinsics are good the model should be able to fine tune on the new data (as long as the relative poses aren't too difficult to learn).
Yes, tried it. More or less the same obscure depth maps. I've also tried visualizing prediction
and target
in the compute_loss
after training for 20 epochs (warped images from -10 and 10 frames vs target image):
My current problem seems very similar to this issue, but no answer there. Also, I've seen quite some comments of a similar nature: loss is around 0.13 and just doesn't converge.
What are the intrinsics after normalization?
Originally, for images of size 1080x1920 the intrinsics are:
self.K = np.array([
[1158, 0, 960, 0],
[0, 1158, 540, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]], dtype=np.float32)
I'm resizing the images for training to 576x960, and scale intrinsics accordingly:
[[579. 0. 480. 0. ]
[0. 617.6 288. 0. ]
[ 0. 0. 1. 0. ]
[ 0. 0. 0. 1. ]]
Finally, after normalizing (this one is used throughout training):
[[0.603 0. 0.5 0. ]
[0. 1.072 0.5 0. ]
[0. 0. 1. 0. ]
[0. 0. 0. 1. ]]
Thanks for these – these intrinsics look like they are in a reasonable range, I think. (BTW where did you get the focal lengths for this GTA data from?)
Do the depths from the pretrained KITTI model look reasonable? If so, you can use the KITTI model to check if everything is set up correctly without needing to retrain:
- Start training with the KITTI weights loaded
- Verify the depth predictions look like some form of reasonable scene shape (
disp_0
tab in tensorboard) - Check the reprojection image (
color_pred_*
tab.) - this should look like a reasonable reconstructed image (not the brown stripy-line images you shared above) - You don't need to run any long training to check this – just the very first plot is enough, after about a minute or so of running the training code.
If it looks bad, some things to check:
- The intrinsics (again!). I know you have done this twice but this is honestly where 40% of mistakes happen. Check where the focal lengths come from, and that you trust that source of information.
- Check you haven't accidently made any other modifications to the codebase other than in the dataloader.
- Check how spaced apart are the image sequences (open a
color_0_0
image from tensorboard in a new tab, and also acolor_*_0
image in a separate tab. Flick between them. Does the distance the camera travels look similar to the distance travelled when you do the same for the KITTI dataset?) - I would disable flip augmentation for this debugging step
- Check that the network is definitely using the output of the pose network, and isn't using the extrinsics provided by the dataloader (these blocks of code should never be run: this and this.
- Check the run command is basically what we provide in the README for monocular training.
Overall, for this sort of debugging, tensorboard is your friend. Check you can get reasonable reprojected images. And you can do this without retraining.
Let us know how you get on!
Thanks for such an elaborate response! It definitely helped me a lot!
-
Intrinsics are taken from here. They're 100% correct, I've constructed point clouds with them and they look really nice
-
I just wiped out the whole repo and started from "scratch" again. The loss finally started to go down, I disabled flip, kept the KITTI image size, considered both KITTI and GTA datasets separately, and checked the following options:
-
Overfit from the checkpoint, the loss goes down, the depth looks reasonable, the edges become sharper as the training progress
-
Overfit from scratch with weights init from
--weights_init 'scratch'
. The loss goes down and stagnates around 0.07, but the depth estimation doesn't make any sense, it's pretty much dead. The reprojected image is close to the target. Did that for 100 epochs just to be sure -
Overfit from scratch with weights init from
--weights_init 'pretrained'
. The exactly same behavior as withscratch
Maybe this happens because the problem itself has infinitely many solutions
- I actually have the ground truth depth maps in meters and while the predicted depth visually looks fine even after overfitting is not on the same scale. For example, the predicted value is 2, while the real one is 11. In the repo you only scale the depth for the stereo setup where the transformation between cameras is known. Is there any way to do this having the gt depth maps but in a monocular setup?
Ok great – so it works when starting from KITTI, but it's the scratch training which is still the problem?
Might be worth trying with the --v1_multiscale
option, to see how that helps. Not quite sure why it would help but worth trying.
Do you have any more images from your dataset to share? I wonder if the types of textures in them might just be especially problematic.
It works starting from kitti, visually it looks good, but quantitatively not really for GTA.
Yes, would be great to have the ability to train it from scratch.
Well, I can't overfit even on Kitti tracking left triplet, so maybe the texture is not a problem. I've just taken 000000.png, 000001.png, 000002.png
for 0000
sequence. The loss goes down but the disparity doesn't look good. The only thing I've changed was the get_image_path
in KITTIRAWDataset
:
Yes I would expect that this would be a problem to overfit to small sections of KITTI – this type of self-supervised training really benefits from seeing lots of different images, to help the network get out of local minima.
How many GTA images are you training from?
Ok, I see, maybe it's the same for GTA. On GTA I also tried only 3 triplets, 9 images in total. But in general, I have around a million.
What do you think might be a reasonable number of triplets to overfit on?
Ah ha! This could be the problem, I'd hope.
I'd take a similar number as are in the KITTI dataset (I forget exactly how many)
But maybe at least 10k triplets?
Please do let us know if you have any progress – we'd love to see some GTA-trained depth maps!
@mdfirman yes, of course, I'm currently training, will post as soon as ready!
@mdfirman I've finally finished my experiments. Here are some outputs in case someone is interested.
TLDR:
- The model fails to overfit on a smaller data sample and need a certain chunk of triplets (8k in my case)
- Silhouettes on the depth maps look fine, but the scale is off
- Best option is to start from a checkpoint
- The training itself is not really stable
- Egomotion quality is not tested visually yet
Some details on the dataset. The dataset is not really a GTA dataset, but rather a GTA dataset for specific scenarios where we have a lot of sequences and really crowded scenes viewed from the pedestrian's viewpoint. This dataset should be published soon and currently is in pre-print.
The dataset contains both dynamic and static camera sequences and has a really small baseline. I've sampled only dynamic sequences like took every 5th image to form the triplets to simulate the movement in KITTI. I've also increased the resolution to 540x960. For the first shot, I've selected only 8k images in train and 1k images in validation. In the dataset, I also have the ground-truth depth and egomotion, so I was able to compute losses in the validation step for them.
I tried the following things:
Method
Depth RMSE
Egomotion L1
8k_hr_pretrained (20 epochs)
24.49
0.02
8k_hr_pretrained_5 (20 epochs)
20.74
0.13
8k_hr (30 epochs, init with imagenet)
24.87
0.02
8k_hr_5 (30 epochs, init with imagenet)
22.62
0.11
8k_hr_5 (30 epochs, scratch)
15.6 (Dead)
0.11 (Dead)
Here 8k
is for the number of images, hr
stands for high resolution, '_5` is for the window in which we take the images (in this case -5, 0, 5). Obviously, the smaller the window the smaller the egomotion error.
Despite training from the very scratch without imagenet initialization gives the smallest error, the depth was completely flat, close to 0 everywhere and the reconstructed images looked fine. The options with the pre-trained model worked best and it was selected to train on larger dataset size.
I trained the model with checkpoint initialization with -5, 0, 5 frame ids on 50k images and 12k images in the validation set for 50 epochs. The results for the depth silhouettes look fine, but the scale is very different. For instance, in the case where we have a depth equal to 17, it's around 3 in the prediction. Moreover, we have an issue of infinite depth, but this was already discussed in the issues here and I'm thinking of moving here for this particular issue to utilize segmentation masks we also have.
Regarding the losses, the training doesn't look stable. I'm interested in whether you also had a similar loss/metric behavior:
Depth metrics:
Egomotion metrics:
Thanks for reporting back!