GCVit training hparam clarifications

I've attempting to train the 'tiny' model from scratch for verification but running into some problems. I hit an accuracy wall of sorts approaching 80% and the model should well pass that based on your results.

I am (obvioulsy) familiar with the train scripts, hparams in general, have trained quite a few related models to expected accuracy.

A few questions

What is your train batch size? the paper mentions 'total batch size' is 128, that seems very low, when I read total I assume it means global (the sum across all train nodes). You also use 1e-3 learning rate which is very very high for 128 so perhaps you meant 128 is the local (per GPU) batch size?
If 128 is the local batch size, what is the global? How many GPUS did you train with (N * 128)?
Are there any sensitivies, epsilons, etc you ran into training that I may not have noticed a deviation from timm defaults?
Your train command indicates you used EMA weight averaging, did you compare without?
Your train command suggests you did NOT use AMP? Is that true, the accuracy / loss wall I'm hitting looks like it could be a floating point precision issue

Aug 21 '22 21:08 rwightman

Hi @rwightman

Thank you for the insightful comments/questions. For starters, our work is built on top of timm==0.5.4 (default settings, etc.). In addition, for all experiments, we used 4 nodes ( 4 x 8 V100 = 32 GPUs). I'd like to provide more details/answers regarding the questions:

For GC ViT Tiny, the uploaded model weights/logs use a total batch size of 32 x 128 (N_gpus * batch_size_per_gpu) = 4096. However, we have also trained with a total batch size of 32 x 32 = 1024 and achieved very similar results (please see the table below). When using a local batch size of 128 (total 4096), we use a learning rate of 0.005 (as specified here). Otherwise, we use a learning rate of 0.001 when local batch size is 32 (total 1028).
The global batch size is 32 x 128 = 4096 when the local batch size is 128. We used 32 GPUs as specified above.
We did not run into any sensitivities, epsilons, etc. at and used all the defaults from timm==0.5.4. In fact, I have actually uploaded the entire config file, as generated by timm, in this link for a through overview of all hyper-parameters.
Yes. We achieve slight improvement using EMA, and generally find EMA to be more useful. Results for experiments with and without EMA are listed below.
We actually did use AMP for all experiments, as indicated in the config file. But for clarity, I have also added --amp to the training commands.

model	top-1	local batch size	global batch size	EMA	AMP
GCViT-T	83.40	128	4096	Yes	Yes
GCViT-T	83.38	128	4096	No	Yes
GCViT-T	83.39	32	1024	Yes	Yes
GCViT-T	83.37	32	1024	No	Yes

In addition to the above, we have also used the Swin Transformer epoch-based scheduler by slighly modifying the timm's iteration based scheduler (link here). Our motivation was to be comparable with Swin training settings. We will update the arXiv manuscript to reflect these information very soon.

Given my previous experience, I believe that timm library is the most effective and efficient way for ImageNet training, and an easy way to reaching SOTA or surpassing it, without needing to change much.

Aug 22 '22 03:08 ahatamiz

@ahatamiz thank you for the detailed response, my LR needs a bit of adjustment based on that info, I'll try another run with that and a new seed, I noticed the scheduler change, for long training runs I've found it made very little difference (why I have been slow to support per-step update)....

Aug 22 '22 04:08 rwightman

Hi @rwightman

Sure. I totally agree that scheduler would not make a big difference. Looking forward to know the results, and would be happy to provide more details if needed.

Thanks.

Aug 22 '22 04:08 ahatamiz

GCVit GCVit copied to clipboard

training hparam clarifications

GCVit
GCVit copied to clipboard