Metric3D Metric depth recovery part

Hello! Thank you for your amazing work! I already read the paper and I might miss some information, I would appreciate your answers, thus I have several questions:

For the backbone, as stated in the paper, you used DINO v2-reg as the encoder and DPT as the decoder. Did you freeze the encoder when training?
When preprocessing the data, did you normalize the dataset to relative depth before inputting it to the model? If so, which part of the pipeline emphasizes the recovery of metric depth? Which part from the source code did those roles?
As stated in the supplementary materials, you used depth bins ranging from [0.1m,200m] for ViT models. Could you elaborate more what kind of depth bins processing that you did?

As far as I know, DINOv2 normalized the input to extract the features. Thus, if we froze the encoder, the output from the DPT decoder would be in relative depth too. I am still new to this, so please correct me too if I am wrong :)

Thank you!

Aug 27 '24 08:08 xyclone10

Hello! Thank you for your amazing work! I already read the paper and I might miss some information, I would appreciate your answers, thus I have several questions:

For the backbone, as stated in the paper, you used DINO v2-reg as the encoder and DPT as the decoder. Did you freeze the encoder when training?

When preprocessing the data, did you normalize the dataset to relative depth before inputting it to the model? If so, which part of the pipeline emphasizes the recovery of metric depth? Which part from the source code did those roles?

As stated in the supplementary materials, you used depth bins ranging from [0.1m,200m] for ViT models. Could you elaborate more what kind of depth bins processing that you did?

As far as I know, DINOv2 normalized the input to extract the features. Thus, if we froze the encoder, the output from the DPT decoder would be in relative depth too. I am still new to this, so please correct me too if I am wrong :)

Thank you!

Thank you for your interest, I hope the following answers help: (1) Not frozen. We observe the loss converges much faster and to a lower level if all the parameters are trained. (2) No, but all metric depth maps are normalized according to the camera focal length ratio and size ratio. Please refer to class LableScaleCanonical for more details. (3) Uniform bins in log space, which is a common practice stressed in DORN (Fu etal CVPR2018). Please refer to its definition here. (4) A normalized feature input does not necessarily mean the predicted depth should also be normalized relative depth.

Sep 03 '24 18:09 JUGGHM

@JUGGHM Hi, Im trying to use your model for outdoor scene to get metric depth to eventually get a sense of scale at the specific location, but the depth values are too varied, can you help me with this more details here

Sep 23 '24 01:09 MoAbbasid

Hello!

I have some questions regarding the HDNL loss that you proposed. So, I tried to train with AbsRel + SiLog + HDNL. But it didn't converge at all, does this loss have anything like special cases to be converged with other losses?

This is when the epoch is in training

After one epoch, it gave this kind of result, and it is similar results until some epochs.

I already confirm that the AbsRel+SiLog won't affect this, I simply add 0.5 of HDNL and it didn't converge.

Thank you!

Jan 03 '25 08:01 xyclone10

@xyclone10 Have you tried using a small data split for overfitting? When we train, it is easy to converge when we combine all losses.

Jan 03 '25 08:01 YvanYin