TubeTK Paper's experimental details

Paper's experimental details

Open ujjwal-ai opened this issue 4 years ago • 0 comments

I draw your attention to following points

The MOT17 dataset in the training code has a length of 632040.
I am using this code for training on MOT17 on a cluster of 8 V100 GPUs and nvidi-apex and I cannot fit a batch size more than 2 on one GPU. Thus effectively 16 data points are processed by the node per iteration.
As implied by point 1, it means that it will take 632040/16 iterations to finish one epoch, and this means ~39500 steps per epoch.
Now with the node configuration, I am getting a training speed of 4.44 steps/seconds. So, one epoch would take ~2 days to finish.

a) The authors mention using a batchsize of 32. How many GPUs or nodes were used to accomodate this batchsize ?

b) How long it took to train the model on MOT17 and JTA ?

c) Are my estimates close to what the authors experienced ?

Aug 18 '20 00:08 ujjwal-ai