rezero
rezero copied to clipboard
weight decay for the resweight?
Hello, I read the paper, and it is interesting to me. I have a question.
Many implements including Huggingface exclude LayerNorm and biases when decaying weights for convergence. (https://github.com/huggingface/transformers/issues/492) Is it helpful to exclude the resweight parameters when decaying weights??
Yes, it would seem reasonable to not decay resweights since other parameters are already being decayed.
@calclavia I have the same question, but did this prove to be better? Or is it just to speed up calculations?