distributed-pytorch
distributed-pytorch copied to clipboard
Scaling the learning rate in DDP
Hi, I understand that we need to scale the learning rate in DDP to make sure the gradients are averaged correctly at the end. But I'm confused about the choice of 256. in the ddp_apex Python script and e.g. the use of 512. in this DeiT github repo.
I don't think this can be an arbitrary value but is bound in such a way that "LR=LR*X" where "X>1". If this is correct why not just do: lr_scaled = lr * world_size?