retnet traning config
Hello,
I have followed the training configuration introduced here (https://github.com/microsoft/torchscale/issues/52) with retnet_medium architecture. I have some questions that I would appreciate if anyone could answer them.
The first is about the initialization. From the RETNET paper https://arxiv.org/abs/2307.08621, I saw that parameters were initialized following deepnet. So I am wondering why in the RetNetConfig it is set to False, and where should I set it True? (https://github.com/microsoft/torchscale/blob/main/torchscale/architecture/config.py#L239)
If I simply add "--deepnorm" in command line, this will be activated together with subln (https://github.com/microsoft/torchscale/blob/main/torchscale/architecture/config.py#L240), then I found the output of each layers getting larger and larger with the layer id increasing.
The second is about the vocabulary. I am newer to fairseq so I am not sure how to deal with a large dataset via fairseq_preprocess. I am trying to use MINIPILE while the dict.txt has 32309612 lines. It seems too large so I am wondering if there is some official recommendation for this part.
The third is about --share-decoder-input-output-embed, Is it recommended? I am sorry if I missed in paper.
Thank you guys in advance:)
Hi, Is there any resolution to this question for the initialization and recommended training configs to reproduce the paper results? I am also seeing some instability with the default configs. Thanks so much!
- --share-decoder-input-output-embed saves model parameters especially when the model size is small. The performance is almost the same. We activate it in our experiment.
- Don't activate --subln or --deepnorm. The current initialization is good enough.
- The training instability comes from Linear bias and eps in LayerNorm. In our experiment, we set bias=False and eps=1e-5. Besides, RMSNorm is helpful for stability so we make a modification.
Hi, Is there any resolution to this question for the initialization and recommended training configs to reproduce the paper results? I am also seeing some instability with the default configs. Thanks so much!
@simran-arora @hanlinxuy
-
The LN eps was modified from 1e-6 to 1e-5 as in the commit https://github.com/microsoft/torchscale/commit/d1fefe9c22bad07535f56c4c461b94588dd8cc84
-
The RMSNorm is also used in the commit https://github.com/microsoft/torchscale/commit/5c89ffbeea3ba458a865a569f947bf82cca50090 , so that the effects of LN_eps can be eliminated
-
For the RetNet implementation, the initialization principle proposed in DeepNet has been integrated. So the arguments
--subln or --deepnormshould not be added. -
Removing bias also improves training stability.
The latest released code has considered the above points.
Thanks so much! I had used layer norm and did not set the bias=False. Will try switching these.
Adding the explicit deepnorm initialization also improved stability for my downstream runs, but I will try using the recommended techniques instead.
@simran-arora It's better to set bias=False both in layer norm and nn.Linear.
Besides, would you mind sharing the training details with us? e.g. corpus, model size, and hyper-parameters. We'd like to see the instability setting.
Hi, Is there any resolution to this question for the initialization and recommended training configs to reproduce the paper results? I am also seeing some instability with the default configs. Thanks so much!
@simran-arora @hanlinxuy
- The LN eps was modified from 1e-6 to 1e-5 as in the commit d1fefe9
- The RMSNorm is also used in the commit 5c89ffb , so that the effects of LN_eps can be eliminated
- For the RetNet implementation, the initialization principle proposed in DeepNet has been integrated. So the arguments
--subln or --deepnormshould not be added.- Removing bias also improves training stability.
The latest released code has considered the above points.
Thank you very much! Will try later with those new information!