RAdam issues

fix deprecated ops in pytorch 1.6

2

wenhui-prudencemed

Does RAdam usually need an annealing and warm up scheduler?

since it fixes the variance issue wouldn't it mean it still needs a annealing scheduler (but not a warm up scheduler)?

brando90

Are the plots you have wrt epochs or iterations?

1

For example figure 1: ![radam fig1](https://user-images.githubusercontent.com/1855278/128782779-50cc2bdf-8e40-4042-901e-47c5d45e446d.png) in general, I am trying to figure out if in general people train transformers wrt epochs or iterations (1 iteration is one batch).

brando90

Overload of addcmul_ is deprecated:

2

I'm getting these warning as I train. ``` UserWarning: This overload of addcmul_ is deprecated: addcmul_(Number value, Tensor tensor1, Tensor tensor2) Consider using one of the following signatures instead: addcmul_(Tensor...

sooheon

RAdam for pytorch official

6

I am curious, why hasn't RAdam been included official in pytorch? https://github.com/pytorch/pytorch/issues/24892

brando90

NaNs

1

I observed that the RAdam method can start at first epochs to be produce NaN Loss while Adams not. It's not only for one or two experiments but a general...

thegodone

simplify add_

1

Hi, I have a small optimization to suggest: Is there any particular reason to not simplify > [line 84] p_data_fp32.add_(-group['weight_decay'] * group['lr'], p_data_fp32) into > p_data_fp32.mul_(-group['weight_decay'] * group['lr']) ? Other...

LucasMourot

Will radam be affacted by weight decay?

Hi, It is said that naive adam will make performance bad if weight decay is added. Thus people invented adamW to make adam compatible with weight decay. Now I have...

CoinCheung

Cannot reproduce the PPL on One Billion Words

1

For the experiments of language model (LM) on One Billion Words, the final test PPL with Adam and RAdam are around 41 and 40, respectively, worse than the numbers reported...

XuezheMax

how can i use this in tf1.4

How can i use this in tf1.4

HouGall

RAdam
RAdam copied to clipboard

Metadata

fix deprecated ops in pytorch 1.6

Does RAdam usually need an annealing and warm up scheduler?

Are the plots you have wrt epochs or iterations?

Overload of addcmul_ is deprecated:

RAdam for pytorch official

NaNs

simplify add_

Will radam be affacted by weight decay?

Cannot reproduce the PPL on One Billion Words

how can i use this in tf1.4

← Metadata

Owner

Metadata

RAdam RAdam copied to clipboard

Metadata

← Metadata

Owner

Metadata

RAdam
RAdam copied to clipboard