RAdam icon indicating copy to clipboard operation
RAdam copied to clipboard

On the Variance of the Adaptive Learning Rate and Beyond

Results 13 RAdam issues
Sort by recently updated
recently updated
newest added

since it fixes the variance issue wouldn't it mean it still needs a annealing scheduler (but not a warm up scheduler)?

For example figure 1: ![radam fig1](https://user-images.githubusercontent.com/1855278/128782779-50cc2bdf-8e40-4042-901e-47c5d45e446d.png) in general, I am trying to figure out if in general people train transformers wrt epochs or iterations (1 iteration is one batch).

I'm getting these warning as I train. ``` UserWarning: This overload of addcmul_ is deprecated: addcmul_(Number value, Tensor tensor1, Tensor tensor2) Consider using one of the following signatures instead: addcmul_(Tensor...

I am curious, why hasn't RAdam been included official in pytorch? https://github.com/pytorch/pytorch/issues/24892

I observed that the RAdam method can start at first epochs to be produce NaN Loss while Adams not. It's not only for one or two experiments but a general...

Hi, I have a small optimization to suggest: Is there any particular reason to not simplify > [line 84] p_data_fp32.add_(-group['weight_decay'] * group['lr'], p_data_fp32) into > p_data_fp32.mul_(-group['weight_decay'] * group['lr']) ? Other...

Hi, It is said that naive adam will make performance bad if weight decay is added. Thus people invented adamW to make adam compatible with weight decay. Now I have...

For the experiments of language model (LM) on One Billion Words, the final test PPL with Adam and RAdam are around 41 and 40, respectively, worse than the numbers reported...

How can i use this in tf1.4