About training supervision for depth values
Hi,
Excellent work, thanks for the detailed paper and prompt model release! I found the paper very insightful, and the results are remarkable across various datasets and tasks.
I have a question regarding the supervision signals used during training, particularly for depth values.
Whether the training process used relative depth ground truth (affine-invariant) or absolute metric depth for supervision?
Hi, thanks for your interest in our work, and apologies for the late response 🙏 The training data must have scale-invariant depth (i.e., rooted at 0, with no unknown shift) and calibrated camera intrinsics. This is because depth values are unprojected to 3D points for the model training, and affine-invariant depth cannot be unprojected. Absolute metrics (meters) are not required. I hope this clarifies your concern : )