AIVFI/Monocular-Depth-Estimation-Rankings-and-2D-to-3D-Video-Conversion-Rankings: Rankings include: BetterDepth Depth Anything DPT F...

trafficstars

Monocular Depth Estimation Rankings
and 2D to 3D Video Conversion Rankings

List of Rankings

Each ranking includes only the best model for one method.

Monocular Depth Estimation Rankings

DA-2K (mostly 1500×2000): Acc (%)>=86
UnrealStereo4K (3840×2160): AbsRel<=0.04
MVS-Synth (1920×1080): AbsRel<=0.06
HRSD (1920×1080): AbsRel<=0.08
Middlebury2021 (1920×1080): SqRel<=0.5
NYU-Depth V2 (640×480): OPW<=0.31
NYU-Depth V2 (640×480): AbsRel<=0.058

2D to 3D Video Conversion Rankings

I. Video Inpainting Rankings

(to do)

II. Light Field Video Reconstruction from Monocular Video Rankings

:crown: 4DLFVD with up to 10×10 real light field views✔️: LPIPS😍 (no data)
This will be the King of all rankings. We look forward to ambitious researchers.
4DLFVD with up to 10×10 real light field views✔️: PSNR😞 (no data)
Hybrid with 7×7 synthetic light field views✖️: LPIPS😍 (no data)
Hybrid with 7×7 synthetic light field views✖️: PSNR😞>=32dB

Appendices

Appendix 1: Rules for qualifying models for the rankings (to do)
Appendix 2: Metrics selection for the rankings (to do)
Appendix 3: List of all research papers from the above rankings

DA-2K (mostly 1500×2000): Acc (%)>=86

RK	Model	Acc (%) ↑ {Input fr.}	Training dataset	Official repository	Practical model	Vapour- Synth
1	Depth Anything V2 Giant ENH: Backbone: DINOv2 (ViT-G/14)	97.4 {1}	Pretraining: BlendedMVS & Hypersim & IRS & TartanAir & VKITTI 2 Training: BDD100K & Google Landmarks & ImageNet-21K & LSUN & Objects365 & Open Images V7 & Places365 & SA-1B	ENH:	-	-
2	GeoWizard Backbone: Stable Diffusion v2	88.1 {1}	Hypersim & Replica & 3D Ken Burns & Objaverse & proprietary		-	-
3	Marigold Backbone: Stable Diffusion v2	86.8 {1}	Hypersim & Virtual KITTI		-	-

UnrealStereo4K (3840×2160): AbsRel<=0.04

RK	Model	AbsRel ↓ {Input fr.}	Training dataset	Official repository	Practical model	Vapour- Synth
1	ZoeDepth +PF_R=128 ENH:	0.0388 {1}	ENH: UnrealStereo4K	ENH:	-	-

MVS-Synth (1920×1080): AbsRel<=0.06

RK	Model	AbsRel ↓ {Input fr.}	Training dataset	Official repository	Practical model	VapourSynth
1	ZoeDepth +PF_R=128 ENH:	0.0589 {1}	ENH: MVS-Synth	ENH:	-	-

HRSD (1920×1080): AbsRel<=0.08

RK	Model	AbsRel ↓ {Input fr.}	Training dataset	Official repository	Practical model	VapourSynth
1	DPT-B + R + AL ENH:	0.074 {1}	ENH: HRSD	ENH: -	-	-

Middlebury2021 (1920×1080): SqRel<=0.5

RK	Model	SqRel ↓ {Input fr.}	Training dataset	Official repository	Practical model	VapourSynth
1	LeReS-GBDMF ENH:	0.444 {1}	ENH: HR-WSI	ENH:	-	-

NYU-Depth V2 (640×480): OPW<=0.31

RK	Model	OPW ↓ {Input fr.}	Training dataset	Official repository	Practical model	VapourSynth
1	FutureDepth Backbone: Swin-L	0.303 {4}	NYU-Depth V2	-	-	-

NYU-Depth V2 (640×480): AbsRel<=0.058

RK	Model	AbsRel ↓ {Input fr.}	Training dataset	Official repository	Practical model	Vapour- Synth
1	Metric3D v2 CSTM_label ENH: Backbone: DINOv2 with registers (ViT-L/14)	0.042 {1}	DDAD & Lyft & Driving Stereo & DIML & Arogoverse2 & Cityscapes & DSEC & Mapillary PSD & Pandaset & UASOL & Virtual KITTI & Waymo & Matterport3d & Taskonomy & Replica & ScanNet & HM3d & Hypersim		-	-
2	Depth Anything Large Backbone: DINOv2 (ViT-L/14)	0.043 {1}	Pretraining: BlendedMVS & DIML & HR-WSI & IRS & MegaDepth & TartanAir Training: BDD100K & Google Landmarks & ImageNet-21K & LSUN & Objects365 & Open Images V7 & Places365 & SA-1B		-	-
3	MiDaS v3.1 BEiT_L-512 ENH: Backbone: BEiT₅₁₂-L (ViT-L/16)	0.048 {1}	Pretraining: ReDWeb & HR-WSI & BlendedMVS & NYU-Depth V2 & KITTI Training: ReDWeb & DIML & 3D Movies & MegaDepth & WSVD & TartanAir & HR-WSI & ApolloScape & BlendedMVS & IRS & NYU-Depth V2 & KITTI		-
4	GeoWizard Backbone: Stable Diffusion v2	0.052 {1}	Hypersim & Replica & 3D Ken Burns & Objaverse & proprietary		-	-
5	Marigold Backbone: Stable Diffusion v2	0.055 {1}	Hypersim & Virtual KITTI		-	-
6	GenPercept Backbone: Stable Diffusion v2.1	0.056 {1}	Hypersim & Virtual KITTI		-	-
7	NeWCRFs + LightedDepth ENH:	0.057 {2}	ENH: NYU-Depth V2	ENH:	-	-
8	UniDepth-V Backbone: DINOv2 (ViT-L/14)	0.0578 {1}	A2D2 & Argoverse2 & BDD100k & CityScapes & DrivingStereo & Mapillary PSD & ScanNet & Taskonomy & Waymo		-	-

Hybrid with 7×7 synthetic light field views✖️: PSNR😞>=32dB

RK	Model	PSNR ↑ {Input fr.}	Training dataset	Official repository	Practical model	VapourSynth
1	LFVRT MDE: DPT Backbone: ViT	32.66 {3+1D}	GoPro & TAMULF	MDE:	-	-

📝 Note: The above ranking includes only one model, as the other methods are image-based and don't have any temporal information making them unsuitable for light field video reconstruction from monocular video.

Appendix 3: List of all research papers from the above rankings

Method	Paper	Venue
Depth Anything	Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
Depth Anything V2	Depth Anything V2
DPT	Vision Transformers for Dense Prediction
FutureDepth	FutureDepth: Learning to Predict the Future Improves Video Depth Estimation
GBDMF	Multi-Resolution Monocular Depth Map Fusion by Self-Supervised Gradient-Based Composition
GenPercept	Diffusion Models Trained with Large Data Are Transferable Visual Models
GeoWizard	GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image
LeReS	Learning to Recover 3D Scene Shape from a Single Image
LightedDepth	LightedDepth: Video Depth Estimation in light of Limited Inference View Angles
LFVRT	Synthesizing Light Field Video from Monocular Video
Marigold	Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation
Metric3D	Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image
Metric3D v2	Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation
MiDaS	Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer
MiDaS v3.1	MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation
NeWCRFs	Neural Window Fully-connected CRFs for Monocular Depth Estimation
PatchFusion	PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation
R + AL	High-Resolution Synthetic RGB-D Datasets for Monocular Depth Estimation
UniDepth	UniDepth: Universal Monocular Metric Depth Estimation
ZoeDepth	ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Monocular-Depth-Estimation-Rankings-and-2D-to-3D-Video-Conversion-Rankings
Monocular-Depth-Estimation-Rankings-and-2D-to-3D-Video-Conversion-Rankings copied to clipboard

Metadata

Monocular Depth Estimation Rankings

2D to 3D Video Conversion Rankings

I. Video Inpainting Rankings

II. Light Field Video Reconstruction from Monocular Video Rankings

Appendices

DA-2K (mostly 1500×2000): Acc (%)>=86

UnrealStereo4K (3840×2160): AbsRel<=0.04

MVS-Synth (1920×1080): AbsRel<=0.06

HRSD (1920×1080): AbsRel<=0.08

Middlebury2021 (1920×1080): SqRel<=0.5

NYU-Depth V2 (640×480): OPW<=0.31

NYU-Depth V2 (640×480): AbsRel<=0.058

Hybrid with 7×7 synthetic light field views✖️: PSNR😞>=32dB

Appendix 3: List of all research papers from the above rankings

← Metadata

Owner

Metadata

Monocular-Depth-Estimation-Rankings-and-2D-to-3D-Video-Conversion-Rankings Monocular-Depth-Estimation-Rankings-and-2D-to-3D-Video-Conversion-Rankings copied to clipboard

Metadata

Monocular Depth Estimation Rankings

2D to 3D Video Conversion Rankings

I. Video Inpainting Rankings

II. Light Field Video Reconstruction from Monocular Video Rankings

Appendices

DA-2K (mostly 1500×2000): Acc (%)>=86

UnrealStereo4K (3840×2160): AbsRel<=0.04

MVS-Synth (1920×1080): AbsRel<=0.06

HRSD (1920×1080): AbsRel<=0.08

Middlebury2021 (1920×1080): SqRel<=0.5

NYU-Depth V2 (640×480): OPW<=0.31

NYU-Depth V2 (640×480): AbsRel<=0.058

Hybrid with 7×7 synthetic light field views✖️: PSNR😞>=32dB

Appendix 3: List of all research papers from the above rankings

← Metadata

Owner

Metadata

Monocular-Depth-Estimation-Rankings-and-2D-to-3D-Video-Conversion-Rankings
Monocular-Depth-Estimation-Rankings-and-2D-to-3D-Video-Conversion-Rankings copied to clipboard