Liyuan Liu
Liyuan Liu
Thanks for bringing this up. In our analysis & experiments, we haven't try any learning rate restarts. I agree this issue may due to numerical instability or algorithm design. Will...
Also, RAdam didn't obviate all needs for warmup : ( we found in some cases, adding additional warmup gets a better performance (some discussions are put at: https://github.com/LiyuanLucasLiu/RAdam#questions-and-discussions).
@e-sha the `step_size` here is not the learning rate, but more like the step size ratio. When `N_sma` < 5, the adaptive learning rate will be turned off, and `step_size`...
Thanks for letting us know @e-sha, can you provide a full script to reproduce the result? I'm not sure why `RAdam` behaves in this way. Intuitively, SGDM should be more...
@e-sha Thanks for letting us know : -) I guess you mean the problem is in parameter values? or gradient values? I think in the first iteration SGDM, although with...
I see, I understand it know, thanks for sharing. BTW, people find that using the gradient clip also helps to stabilize the model training.
@e-sha I added an option to decide whether to use sgdm https://github.com/LiyuanLucasLiu/RAdam/commit/373b3e405c7f8d24fe068aee0472e5c3ae231cdc
Thanks for reaching out. I haven't observed this and I'm wondering whether you can provide a simple setup to reproduce this phenomenon. BTW, there is a known issue that can...
Hi @Tony-Y, I'm curious why you prefer to use Adam with a warmup instead of RAdam. I think the very basic fact both papers agree on, is that it's necessary...
Hi, it helps to provide a script / setting to reproduce the error.