flowmap icon indicating copy to clipboard operation
flowmap copied to clipboard

Questions Regarding Optical Flow Supervision and Potential Enhancements

Open linwk20 opened this issue 1 year ago • 3 comments

First of all, thank you for your impressive work. I’ve been searching for methods that can provide accurate dense depth maps (which is why I believe FlowMap is significantly superior to colmap). It seems like using optical flow to fine-tune depth networks is a great idea. I have the following questions:

  • Why supervise depth with optical flow? Is it because optical flow typically offers higher accuracy and can offer subpixel reprojection errors for depth network? The reason i am asking this is that optical flow might not be accurate since it also comes from DNN and why we don't co-optimized it for the scene?
  • Potential improvement with a pretrained MVS model? If we use a pretrained large reconstruction model that takes multi-view as input as the depth estimator, is there a chance of significantly improving the final performance? Or do you think optical flow supervision already a form of multi-view stereo (MVS), making a pretrained MVS model unnecessary?
  • Can increasing resolution and image count improve depth accuracy? The current training resolution and the number of images supported are limited by GPU memory. However, we know that for models using Layer Norm instead of Batch Norm, we can accumulate gradients to achieve an equivalent large batch size (for example, the ViT model used in DepthAnything v2 follows this approach). If we use this method to greatly increase resolution and image count, do you think it will improve the final depth accuracy?

These are just some speculations, and I look forward to your response. Your thoughts may help us design more reasonable experiments. Thank you!

linwk20 avatar Sep 19 '24 18:09 linwk20

First of all, thank you for your impressive work. I’ve been searching for methods that can provide accurate dense depth maps (which is why I believe FlowMap is significantly superior to colmap). It seems like using optical flow to fine-tune depth networks is a great idea. I have the following questions:

  • Why supervise depth with optical flow? Is it because optical flow typically offers higher accuracy and can offer subpixel reprojection errors for depth network? The reason i am asking this is that optical flow might not be accurate since it also comes from DNN and why we don't co-optimized it for the scene?
  • Potential improvement with a pretrained MVS model? If we use a pretrained large reconstruction model that takes multi-view as input as the depth estimator, is there a chance of significantly improving the final performance? Or do you think optical flow supervision already a form of multi-view stereo (MVS), making a pretrained MVS model unnecessary?
  • Can increasing resolution and image count improve depth accuracy? The current training resolution and the number of images supported are limited by GPU memory. However, we know that for models using Layer Norm instead of Batch Norm, we can accumulate gradients to achieve an equivalent large batch size (for example, the ViT model used in DepthAnything v2 follows this approach). If we use this method to greatly increase resolution and image count, do you think it will improve the final depth accuracy?

These are just some speculations, and I look forward to your response. Your thoughts may help us design more reasonable experiments. Thank you!

Hi, I am currently looking for an MVS model that has been trained on a large scale dataset, do you have any recommendations?

booker-max avatar Nov 21 '24 11:11 booker-max

There are plenty of such MVS, one I know that runs at real time is Spann3r (https://hengyiwang.github.io/projects/spanner), but it is more like an academic paper, not trained on large dataset.

linwk20 avatar Nov 21 '24 14:11 linwk20

I think optical flow error may be alleviate by the weights map prediced by FlowMap model. Co-training flow & depths is possible, I think it is a direction needed to explore, but the key issue may be the data resource.

And I've extend it to large scale datasets training & dynamic scenes, which shows promising results. However, I only focus on Hand-Object-Interaction Videos (since I'm focusing on EmbodiedAI & Robot Learning reseach). The project is called UniHOI and the repository is https://github.com/michaelyuancb/unihoi. I'm still improving the model, code & weights will be released in the future.

michaelyuancb avatar Nov 27 '24 06:11 michaelyuancb