MiDaS Temporal consistency

Hi, I really love your models and they are extremely helpful. However, when I apply the model to a video, the scaling is very inconsistent. Does anyone have some tips with how to improve the temporal consistency when applying the model in a live setting? It does not have to be a ML kind of solution. Also, is there a way to constrain the depth maps? Like for example it only displays up to 10 cm a depth map, and beyond that it is just black? Thanks in advance.

Jun 07 '22 09:06 ReiniertB

I'd like to improve the temporal consistency too. https://youtu.be/z6fK-kdMZNQ

I guess the depth maps can be "normalized" by some post processing and the same thing applies for your request of constraining the max depth but it could be rather tricky for moving camera and it could be too slow for "live setting"...

Feb 26 '23 16:02 vitacon

Um, I see the values from .png can't be really consistent because they are "normalized" to fill the whole range of the output (e.g. 0-255)

out = max_val * (depth - depth_min) / (depth_max - depth_min)

I suppose that the right way how to keep the results consistent for different frames is to use values from PFM, analyze all files to find the global min and max and use these values to convert all PFMs to PNGs.

The hacky way could be looking into a few PMFs to see typical range of the results, make depth_min and depth_max constants and hope for the best... =}

Feb 26 '23 19:02 vitacon

Any pointers here would be appreciated :) I am unable to analyze all files, as I am using this on a robot that operates in the real world. The hacky way also doesn't work because the scaling is inconsistent over time.

Mar 20 '23 16:03 hildebrandt-carl

Well, I removed the relative normalization and used this:

depth_min = 0
depth_max = 12000
max_val = (2**(8*bits))-1
np.clip(depth, depth_min, depth_max, depth)

Of course, it did not help much. Sometimes MiDaS really surprises me how much different are its results with very similar input frames.

https://youtu.be/81ScNArJ-fE

Actually, the output frames are so varied I could not get reasonable results even with area-based normalization.

Mar 21 '23 18:03 vitacon

My guess is that the depth scale is somewhat arbitrary by the nature of the problem. To keep it consistent, data from multiple frames should be used. Like evaluating the network on current image and last image depth map oder some latent state of the last image(s).

Other projects either graft some magic around that using optical flow for camera pose estimation, or by re-training the network with temporal consistency rewards. But this means either retraining the network on actual evaluation of an individual movie, or having a network that just still works on single frames, only with better average consistency. Which may easily fail again of course.

Apr 10 '23 05:04 dronus

Also a secondary network could be trained to filter the final depth images, eg. last three depth images in, one normalized image out. Like temporal hyperresolution.
That could easily be trained on movies with depth ground truth data.

Apr 10 '23 05:04 dronus

Thanks for all the replies. If anyone has a suggestion for depth estimation/prediction networks that can be easily trained without supervision, that would be greatly appreciated.

Apr 24 '23 09:04 ReiniertB

@ReiniertB We have developed a video depth estimation model ViTA based on MiDaS 3.0. Hope this can help you!

Sep 05 '23 08:09 KexianHust

@ReiniertB We have developed a video depth estimation model ViTA based on MiDaS 3.0.

I wonder why you stuck with MiDaS 3.0? What is wrong with 3.1 for you?

Sep 05 '23 10:09 vitacon

@ReiniertB We have developed a video depth estimation model ViTA based on MiDaS 3.0.

I wonder why you stuck with MiDaS 3.0? What is wrong with 3.1 for you?

Because our paper was submitted last year, at that time we can only use MiDaS 3.0. Of course, we would like to train a 3.1 version.

Sep 05 '23 11:09 KexianHust

Because our paper was submitted last year, at that time we can only use MiDaS 3.0.

I see. =) I got confused by this: [08/2023] Initial release of inference code and models.

Of course, we would like to train a 3.1 version.

"Would like" does mean you are planning doing it soon or is it more of a theoretical option? =}

Sep 05 '23 11:09 vitacon

Because our paper was submitted last year, at that time we can only use MiDaS 3.0.

I see. =) I got confused by this: [08/2023] Initial release of inference code and models.

Of course, we would like to train a 3.1 version.

"Would like" does mean you are planning doing it soon or is it more of a theoretical option? =}

I will release the 3.1 version once the models are trained.

Sep 05 '23 11:09 KexianHust

@ReiniertB @vitacon Our work Neural Video Depth Stabilizer (NVDS) is accepted by ICCV2023. NVDS can stabilize any single-image depth predictors in a plug-and-play manner without additional training and any extra effort. We have tried NVDS with MiDaS, DPT, MiDaS 3.1, and NewCRFs. The results are quite satisfactory. You can simply change the depth predictor to MiDaS 3.1 (only adjusting one line in our demo code) and our NVDS can produce significant improvement in temporal consistency.

Sep 19 '23 07:09 RaymondWang987

@KexianHust Hi, I'm really interested in your work. It seems like you haven't made your papers public. Could you share your paper link?

Mar 27 '24 08:03 CJCHEN1230