EcoDepth Training speed

Hello,

Thank you so much for sharing this amazing work!

I try to train the model with the NYU dataset. The paper says about 21 mins per epoch for 8 A100 GPUs. Say, I am using a single A100 GPU with batch-size 32, and in my case, it seems stuck in the following step for ever... Meanwhile, I can run the evaluation without issue. I don't know what the problem could be and I would appreciate any help/hint!

with torch.no_grad():

        # convert the input image to latent space and scale.

        latents = self.encoder_vq.encode(x).mode().detach() * self.config.model.params.scale_factor

P.S., The evaluation results match with the paper well except for sq_rel.

    d1         d2         d3    abs_rel     sq_rel       rmse   rmse_log      log10      silog 
0.9776     0.9973     0.9995     0.0599     0.0194     0.2187     0.0773     0.0259     5.7549

Again, thanks for the great work!

May 14 '24 02:05 yongmayer

Hi @yongmayer, thanks for appreciating our work. So we used a per device batch size of 4 resulting in a total batch size of 32 with 8 GPUs. The speed issue is probably because you are using a per device batch size of 32 instead of 4. Could you try once with a batch size of 4 (with a single GPU ie. you current setup) and let me know if it works?

May 17 '24 13:05 Aradhye2002

Hi @Aradhye2002, Thank you so much! That works!

I have another question if you don't mind asking. How should I understand the diffusion process in EcoDepth? From line 96 in EcoDepth/depth/models/model.py (EcoDepthEncoder.forward), I see it uses the Unet from stable diffusion, but cannot see the forward diffusion process. Am I understanding it wrong? I am new to the diffusion-based depth estimation, and I would appreciate your kind explanation a lot!

Again, thank you!

May 20 '24 03:05 yongmayer

Hi @yongmayer. While we do use the stable diffusion backbone, we do not use it as a diffusion model per se. Instead we use the UNet (or backbone) as a feature extractor and utilize the hierarchical feature maps obtained to predict the final depth map. Please refer to our model architecture given in the paper for more details.

Mar 21 '25 12:03 Aradhye2002