ViT-Lens
ViT-Lens copied to clipboard
Alternate depth normalization
The justification in the paper for using disparity is "scale normalization". I know that this comes from OmniVore and ImageBind. However, this does not actually achieve scale normalization.
What could scale normalization mean? disparity images are not scale invariant in the way RGB images are: If you bring a thing closer it will have larger disparities, as opposed to RBG images where colors stay the same. Instead it must mean something like: two images with the same "disparity" should take up the same number of pixels.
To achieve this, you should use f/depth instead of bf/depth. This makes sense because b is an arbitrary value associated with the particular camera setup that you have, and it provides you no information about the geometry of the scene you are looking at. If you change b physically, the depth image does not change, but the disparity does.
One other suggested improvement: when you resize to 224, you're actually implicitly changing the focal length. So if h is the original height of the image, I would suggest computing "disparity" as
(224/h)f/depth
If normalization is having any positive affect, I bet this improved normalization will do better.