rezero weight decay for the resweight?

weight decay for the resweight?

Open Kyeongpil opened this issue 3 years ago • 2 comments

Hello, I read the paper, and it is interesting to me. I have a question.

Many implements including Huggingface exclude LayerNorm and biases when decaying weights for convergence. (https://github.com/huggingface/transformers/issues/492) Is it helpful to exclude the resweight parameters when decaying weights??

Nov 24 '20 02:11 Kyeongpil

Yes, it would seem reasonable to not decay resweights since other parameters are already being decayed.

Nov 28 '20 05:11 calclavia

@calclavia I have the same question, but did this prove to be better? Or is it just to speed up calculations?

Feb 17 '21 16:02 fightnyy

rezero rezero copied to clipboard

weight decay for the resweight?

rezero
rezero copied to clipboard