tapnet icon indicating copy to clipboard operation
tapnet copied to clipboard

Training time and implementation details

Open MLDeS opened this issue 2 years ago • 1 comments

How long did it take to train the Tapir model on 64 TPU-v3 cores on the complete Movi E dataset? How is the time expected to scale to 4 A100 80GB or 40GB GPUs? Also, by 50 000 training steps, I assume it is the number of gradient updates? How many epochs would that be close to approx?

Edit: I see that the batch size is 8 and the dataset size is around 10k : so would it be close to ~40 epochs?

MLDeS avatar Sep 14 '23 07:09 MLDeS

It's a bit difficult to define "epochs" since we sample different points on every step. Our internal dataset is more like 100K videos, and the batch size is 8 per device, so you need to multiply the batch size by 64.

Overall training finishes in about 3 days. For what it's worth, we suspect that it will be more efficient to train TAPIR on NVIDIA hardware since it TAPIR has lots of gather operations, and GPU gather operations are much faster than TPUs. However, internally we don't have access to larger multi-GPU setups, so for us it's still faster for us to use TPUs.

cdoersch avatar Sep 15 '23 09:09 cdoersch