tapnet Training on Kubric Dataset

I am trying to train the TAPIR model on the Kubric Dataset using Google Colab however my code keeps stopping without any errors. I am using the python ./experiment.py --config ./configs/tapir_config.py command and the config file is loaded successfully. The training process stops abruptly without any errors. I am unable to determine the cause and would be really grateful for any help in this regards.

Thank You!

Jul 17 '23 07:07 TahaRazzaq

Apologies for the slow response; it's likely that this is just compilation time (the training graph is complex and the JAX GPU compiler is slow; it might take hours to compile), but it's somewhat time-consuming for us to debug this so we haven't dug into it yet. Hopefully we will find time to do so soon.

Jul 20 '23 21:07 cdoersch

We attempt to reproduce your reported issues and here it is. It took approximately 40 minutes to see the first training log result (Also the codebase uses CPU to train by default and it is super slow). Currently we have not optimize the experience of training locally (if you are).

Also, could you check nvidia-smi and see if your model is built and trained on GPU or not?

Jul 24 '23 09:07 yangyi02

@yangyi02 Thank you for your response. I did enable GPU and the model was built on GPU as well, however the execution stops midway and training doesn't take place.

Jul 24 '23 15:07 TahaRazzaq

@TahaRazzaq From the screenshot, I don't see the training stops.

Could you verify if the training message just hang there (if hanging there, could you just wait for i.e. 1 hour?), or indeed completely stoped?

You can adjust the batch_dim in tapir_config.py to 1 to see if it gives you slightly faster verification.

Jul 24 '23 17:07 yangyi02

@yangyi02 The execution stops since I'm able to run other cells. Even with batch_dim set to 1, within 3 - 5 mins the execution stops. The last message printed is Initializing Parameters after which it displays a few warnings and stops.

Jul 24 '23 17:07 TahaRazzaq