Learning rate scaling
I see the default learning rate of SoundStreamTrainer is 2e-4. I have a few questions:
- Should the LR be doubled if the batch size is doubled?
- Should the LR be doubled if the number of GPUs is doubled, such as training multi-GPU with with accelerate? Or is this effectively scaled inside
train_step()? - Should the LR be doubled if the gradient accumulation steps are doubled? I notice this implementation is doing a custom thing rather than using accelerate's accumulation steps.
@hmartiro oh hey Hayk! yeah, you know, even after all this time, I still don't know the answer to this. maybe an optimizer expert can stand up and say something more declarative, put this to rest
i think conventional rule of thumb had always been that LR should increase as batch size increases (which scales linearly with number of devices). however, i don't know what the exact relationship should be. and clearly there are some papers that ignore this (for example, recent Llama paper still used learning rate of 3e-4 even with batch size of 4 million...)
for gradient accumulation, huggingface was building that just as I started using accelerate, and when i last used it, it had a few rough edges. i'll give it another try with a new GAN project, and if it works well, redo the code. just being cautious