nested-transformer
nested-transformer copied to clipboard
Training hours & Imagenet accuracy
Hello, thanks for sharing your interesting work.
I was trying to reproduce the NesT-T ImageNet result in this link using TPUs.
Here are my TPU-v3 8 cores result (link) by using exactly the same hyperparameters in imagenet_nest_tiny.py
As you can see, it takes 63 hours for training while your result takes 21 hours. How can I reduce training hours such as your result? If this difference came from the data loading time, could you tell me the types of data storage that you used? Right now, I'm using the google cloud storage bucket for data storage.
Furthermore, I can see the accuracy difference around 0.5% (81.0 v.s. 81.5). Could you explain this difference?
Hi is that possible your actual data disk is away from you machine, so the latency is due to data pipeline.
We have trained tiny version for a few times, it seems there exist some variance around 0.3%. Your l2_grads seems much higher than mine, which is suspicious to me. can you look into from this point?
Thanks for the quick answer. I'll check your point about l2_grads. One more thing, did you use TPU pod slices (ex. 8x4 or 8x8) when training the NesT-tiny? or just used a single TPU?
I am using 2x2 TPU for NesT-tiny (I think it is called single TPU).