Add a table: "Invariance to image size" based on the Spring Benchmark
Many thanks for training and providing models that are better than DepthPro. I am particularly grateful for the fact that UniDepthV2 is flexible in terms of input image size, which is written in this quote:
Specifically, we sample images with variable pixel counts between 0.2MP and 0.6MP, allowing the model to operate effectively across diverse resolutions without being biased toward a single fixed input size.
I am very glad that you have added Fig. 4. "Invariance to image shape". I think it would be a good idea to add something similar for different input image sizes in the form of an ‘Invariance to image size’ table.
In my opinion, the depth estimation is most applicable for video files with a standard resolution of 1920x1080 or 3840x2160 and the typical aspect ratio of most monitors, i.e. 16:9. I would therefore like to ask for a comparison for unidepth-v2-vitl14 on the zero-shot Spring Benchmark Spring FAQ.
This is a synthetic dataset, so it contains very precise depth data and is based on video material with a typical resolution for films: 1920x1080. It is very often used to evaluate depth estimation models, including metric models such as: SharpDepth or DepthPro
The more resolutions the better, especially those above 0.6MP, to see at what resolution the quality starts to drop off. I know that the sizes have to be divisible by 14, so I'm giving only approximate sizes for a typical 16:9 format:
1920x1080 (1080p) 1842x1036 (2x518) 1280x720 (720p) 960x540 (1/3 of 2160p) 921x518 (518 DINOv2)
The best metric for evaluating metric models is of course: F-score.
I will run experiments on larger image sizes for Spring with different inputs and get back here.
However, really large image sizes lead to memory issues and 1080p may be too much for practical usage due to ViT global attention. The need for high-detail depth estimation is a long-standing problem, already tackled here with CNN and more recently in DepthPro. In the latter, the authors tried to amend the global-attention issue by using an "explicit multi-scale" ViT, while that is a good way of tackling the issue, it may come with other downsides (already present ). So in general, when image sizes are that large, i.e. 2K or 4K, probably ViT is not the best option and it may make sense to revert to something like I linked above.