DPT icon indicating copy to clipboard operation
DPT copied to clipboard

The steps to obtain absolute depth on custom dataset

Open 07hyx06 opened this issue 2 years ago • 8 comments

Hi! I look through some discussions in the MiDaS repo's issue and summarize the steps to obtain absolute depth from the estimated dense inverse depth. Am I right?

Step 0: Run SfM to get some sparse 3D points with correct absolute depth, e.g. (x1,y1,d1), ..., (xn,yn,dn) Step 1: Inverse the 3rd dimension to get 3D points with correct inverse depth, e.g. (x1,y1,1/d1), ..., (xn,yn,1/dn) Step 2: Run DPT model to estimate the dense inverse depth map D Step 3: Compute scale S and shift T to Align D with {(x1,y1,1/d1), ..., (xn,yn,1/dn)} Step 4: Output 1/(SxD+T) as the depth

07hyx06 avatar Sep 16 '21 07:09 07hyx06

The steps are correct to align the estimates to the SfM construction, but SfM is unable to recover absolute depth too. So the aligned depth maps would have a consistent scale for the given scene as opposed to an arbitrary scale per image before the alignment, but there would still be a missing global scale to get absolute metric measurements.

These slides give a good overview about SfM ambiguities: https://slazebni.cs.illinois.edu/spring19/lec17_sfm.pdf. Slide 7 shows the relevant issue.

ranftlr avatar Sep 16 '21 09:09 ranftlr

Got it. Thanks for your kind reply!

Another question is, in EVALUATION.md I notice that when evaluate on KITTI, the argument absolute_depth is specified and the prediction is scaled by 256. There are no more post-processing steps to compute the scale and shift because the dpt_hybrid_kitti-cb926ef4.pt model is trained (or finetuned) specifically on KITTI?

https://github.com/isl-org/DPT/blob/f43ef9e08d70a752195028a51be5e1aff227b913/run_monodepth.py#L165 https://github.com/isl-org/DPT/blob/f43ef9e08d70a752195028a51be5e1aff227b913/run_monodepth.py#L166

https://github.com/isl-org/DPT/blob/f43ef9e08d70a752195028a51be5e1aff227b913/util/io.py#L180 https://github.com/isl-org/DPT/blob/f43ef9e08d70a752195028a51be5e1aff227b913/util/io.py#L181

If I want to evaluate the dpt_large model on KITTI, do I still need to follow the above step0~step4 to convert the inverse-depth map?

07hyx06 avatar Sep 16 '21 09:09 07hyx06

Yes to both questions.

A word of caution for evaluating the large model this way: when evaluating the existing large model, which doesn't estimate absolute depth, the numbers are not directly comparable anymore to the numbers in Table 3 since the alignment step will "remove" part of the error. The numbers will be only comparable to Table 1 (or Table 11 in the MiDaS paper) where we did the alignment for all methods to have a fair comparison.

ranftlr avatar Sep 16 '21 09:09 ranftlr

@07hyx06

Another question is, in EVALUATION.md I notice that when evaluate on KITTI, the argument absolute_depth is specified and the prediction is scaled by 256. There are no more post-processing steps to compute the scale and shift ...

Additionally there are used invert=True, scale and shift parameters which depend on the model-weights and dataset (or for real cases - depend on model-weights, camera intrinsics and unit of measurement for depth):

  • when you use Kitti-weights: https://github.com/isl-org/DPT/blob/f43ef9e08d70a752195028a51be5e1aff227b913/run_monodepth.py#L53-L65

  • or NYU-weights: https://github.com/isl-org/DPT/blob/f43ef9e08d70a752195028a51be5e1aff227b913/run_monodepth.py#L68-L80

AlexeyAB avatar Sep 17 '21 02:09 AlexeyAB

@ranftlr @AlexeyAB Thanks for your help!

07hyx06 avatar Sep 17 '21 06:09 07hyx06

Hi! I did some experiments on the flower dataset these days. Can you give me some help to improve the alignment result?

I run COLMAP with the default configuration to obtain the camera parameters and sparse 3D points (I have already converted them to the camera coordinate). The goal is to align the estimation of DPT to the SfM scale and get a dense depth map for every image.

Denote the depth map outputted by DPT as D, with shape of [h,w]; the collection of sparse 3D points as {[x_i,y_i,d_i]}. Firstly I extract the corresponding value of {[x_i,y_i]} from D and get {[D_i]}. Then I simply compute a scale and shift to align {[D_i]} and {[1/d_i]} by np.linalg.lstsq. The fitting result as shown in the figure below. The blue points are (D_i, scale * D_i + shift) and the orange points are (D_i, d_i).

lst

I use the aligned inverse depth map, combined with the SfM scaled camera parameters, to warp a source image to the target viewpoint, the results (warped vs target image) are shown below:

midas_switch

It seems that some pixels are misaligned between the warped image and the target image. Is it a reasonable result? Can I do something to improve the fitting process?

07hyx06 avatar Sep 18 '21 14:09 07hyx06

As the results of the model are not perfect a residual error is expected. How much error, will likely vary per image.

Here are some works that try to address the consistency issue:

https://roxanneluo.github.io/Consistent-Video-Depth-Estimation/ https://robust-cvd.github.io/

These works tackle the case of dynamic objects in the reconstruction. If you expect no independently moving objects in the scene, you can also directly use MVS which will lead to consistent results out of the box.

ranftlr avatar Sep 20 '21 11:09 ranftlr

It seems that some pixels are misaligned between the warped image and the target image. Is it a reasonable result? Can I do something to improve the fitting process?

@07hyx06 Have you found a solution to this problem?

tdsuper avatar Oct 21 '21 02:10 tdsuper