Scaling the learning rate in DDP

Open bartmch opened this issue 4 years ago • 0 comments

Hi, I understand that we need to scale the learning rate in DDP to make sure the gradients are averaged correctly at the end. But I'm confused about the choice of 256. in the ddp_apex Python script and e.g. the use of 512. in this DeiT github repo. I don't think this can be an arbitrary value but is bound in such a way that "LR=LR*X" where "X>1". If this is correct why not just do: lr_scaled = lr * world_size?

Mar 09 '21 08:03 bartmch