Training speed
Hello,
Thank you so much for sharing this amazing work!
I try to train the model with the NYU dataset. The paper says about 21 mins per epoch for 8 A100 GPUs. Say, I am using a single A100 GPU with batch-size 32, and in my case, it seems stuck in the following step for ever... Meanwhile, I can run the evaluation without issue. I don't know what the problem could be and I would appreciate any help/hint!
with torch.no_grad():
# convert the input image to latent space and scale.
latents = self.encoder_vq.encode(x).mode().detach() * self.config.model.params.scale_factor
P.S., The evaluation results match with the paper well except for sq_rel.
d1 d2 d3 abs_rel sq_rel rmse rmse_log log10 silog
0.9776 0.9973 0.9995 0.0599 0.0194 0.2187 0.0773 0.0259 5.7549
Again, thanks for the great work!
Hi @yongmayer, thanks for appreciating our work. So we used a per device batch size of 4 resulting in a total batch size of 32 with 8 GPUs. The speed issue is probably because you are using a per device batch size of 32 instead of 4. Could you try once with a batch size of 4 (with a single GPU ie. you current setup) and let me know if it works?
Hi @Aradhye2002, Thank you so much! That works!
I have another question if you don't mind asking. How should I understand the diffusion process in EcoDepth? From line 96 in EcoDepth/depth/models/model.py (EcoDepthEncoder.forward), I see it uses the Unet from stable diffusion, but cannot see the forward diffusion process. Am I understanding it wrong? I am new to the diffusion-based depth estimation, and I would appreciate your kind explanation a lot!
Again, thank you!
Hi @yongmayer. While we do use the stable diffusion backbone, we do not use it as a diffusion model per se. Instead we use the UNet (or backbone) as a feature extractor and utilize the hierarchical feature maps obtained to predict the final depth map. Please refer to our model architecture given in the paper for more details.