nested-transformer Training hours & Imagenet accuracy

Training hours & Imagenet accuracy

Open arunos728 opened this issue 2 years ago • 3 comments

Hello, thanks for sharing your interesting work.

I was trying to reproduce the NesT-T ImageNet result in this link using TPUs.

Here are my TPU-v3 8 cores result (link) by using exactly the same hyperparameters in imagenet_nest_tiny.py

As you can see, it takes 63 hours for training while your result takes 21 hours. How can I reduce training hours such as your result? If this difference came from the data loading time, could you tell me the types of data storage that you used? Right now, I'm using the google cloud storage bucket for data storage.

Furthermore, I can see the accuracy difference around 0.5% (81.0 v.s. 81.5). Could you explain this difference?

Mar 02 '22 13:03 arunos728

Hi is that possible your actual data disk is away from you machine, so the latency is due to data pipeline.

We have trained tiny version for a few times, it seems there exist some variance around 0.3%. Your l2_grads seems much higher than mine, which is suspicious to me. can you look into from this point?

Mar 02 '22 20:03 zizhaozhang

Thanks for the quick answer. I'll check your point about l2_grads. One more thing, did you use TPU pod slices (ex. 8x4 or 8x8) when training the NesT-tiny? or just used a single TPU?

Mar 02 '22 21:03 arunos728

I am using 2x2 TPU for NesT-tiny (I think it is called single TPU).

Mar 20 '22 17:03 zizhaozhang

nested-transformer nested-transformer copied to clipboard

Training hours & Imagenet accuracy

nested-transformer
nested-transformer copied to clipboard