swa
swa copied to clipboard
SWA with distributed training
In case of distributed training, e.g. DDP, each gpu will only process a minibatch, and the bn statistics computed in each gpu are different. When SWA is adopted, we need to conduct 1 more epoch for bn_update, in this epoch should we use sync bn to average the bn statistics from all gpus? And is there any other modifications we need to make for DDP training?
Hi @milliema I'd say you should do the same thing that is normally done with the batchnorm statistics in the end of parallel training, I imagine you are syncing the statistics between the copies of the model? I personally did not look into distributed SWA a lot, but here is a potentially useful reference: https://openreview.net/forum?id=rygFWAEFwS.