Is there scale ambiguity when train vggt
Hi, dear author,
In your paper, you mentioned that the depth/pointmap scale is defined as the average Euclidean distance of all 3D points in the point map to the origin. Does this mean that the scale value might change even for the same scene if the 3D points are slightly altered (e.g., by randomly dropping some points)? If so, would the ground truth also change for the same input?
Did I understand this correctly, or am I missing something?
Looking forward to your reply. Thank you very much!
Hi yes dropping some points may change the scale, while usually random dropping will only make a very small change, e.g., from 1.7863 to 1.7861.
I am not sure if I understand it correctly, but even in this case, the ground truth is the same for the same input images right?
Sorry for not making myself clear earlier. I'm not referring to exactly the same input, but rather similar ones.
For example, in the Waymo dataset, when the ego vehicle is stationary, the depth of static background objects should remain consistent across consecutive frames. However, since the scale is defined as the average Euclidean distance of all 3D points to the origin, it might slightly vary between frames — especially if there are dynamic objects present in the scene.
In this case, following your approach, would the computed ground-truth depth for those static objects also vary slightly across frames due to changes in scale? I wonder if this introduces inconsistency in the supervision signal, even for the same static background.
Looking forward to your clarification!
Best regards,