Learning rate scaling

Open hmartiro opened this issue 3 years ago • 1 comments

I see the default learning rate of SoundStreamTrainer is 2e-4. I have a few questions:

Should the LR be doubled if the batch size is doubled?
Should the LR be doubled if the number of GPUs is doubled, such as training multi-GPU with with accelerate? Or is this effectively scaled inside train_step()?
Should the LR be doubled if the gradient accumulation steps are doubled? I notice this implementation is doing a custom thing rather than using accelerate's accumulation steps.

Mar 26 '23 07:03 hmartiro

@hmartiro oh hey Hayk! yeah, you know, even after all this time, I still don't know the answer to this. maybe an optimizer expert can stand up and say something more declarative, put this to rest

i think conventional rule of thumb had always been that LR should increase as batch size increases (which scales linearly with number of devices). however, i don't know what the exact relationship should be. and clearly there are some papers that ignore this (for example, recent Llama paper still used learning rate of 3e-4 even with batch size of 4 million...)

for gradient accumulation, huggingface was building that just as I started using accelerate, and when i last used it, it had a few rough edges. i'll give it another try with a new GAN project, and if it works well, redo the code. just being cautious

Mar 26 '23 15:03 lucidrains