swa SWA with distributed training

SWA with distributed training

Open milliema opened this issue 4 years ago • 1 comments

In case of distributed training, e.g. DDP, each gpu will only process a minibatch, and the bn statistics computed in each gpu are different. When SWA is adopted, we need to conduct 1 more epoch for bn_update, in this epoch should we use sync bn to average the bn statistics from all gpus? And is there any other modifications we need to make for DDP training?

Feb 17 '21 06:02 milliema

Hi @milliema I'd say you should do the same thing that is normally done with the batchnorm statistics in the end of parallel training, I imagine you are syncing the statistics between the copies of the model? I personally did not look into distributed SWA a lot, but here is a potentially useful reference: https://openreview.net/forum?id=rygFWAEFwS.

Feb 23 '21 00:02 izmailovpavel

swa swa copied to clipboard

SWA with distributed training

swa
swa copied to clipboard