mamba
mamba copied to clipboard
Can you clarify the training process?
There's no training code included in the repo, so it's hard to tell exactly how the training was done. The paper states:
"We use the Pile dataset (L. Gao, Biderman, et al. 2020), and follow the training recipe described in Brown et al. (2020)." Is this for the transformer models, or also for Mamba? The details of Brown training recipe have a number of different steps and parameters, which parts were used? The LR warmup over 375M tokens? The batch size linear increase from 32K tokens to 0.5M/1M/2M/3.5M tokens over the first 4-12B tokens of training?
The training details in E.2 say "By default, the peak learning rate is the GPT3 specification" or "a peak value of 5× the GPT3 value" in the improved recipe. Depending on the GPT-3 model size, the learning rate ranged from 6e-4 to 6e-5 -- Were different learning rates used to train the differently sized Mamba models?
The readme mentions that the models were trained with AMP. Can I assume that the whole forward function was wrapped in torch.autocast?
Thanks for any additional details you can provide, I'm trying to replicate the training for the smaller sized models in preparation for training a large model
"We use the Pile dataset (L. Gao, Biderman, et al. 2020), and follow the training recipe described in Brown et al. (2020)." Is this for the transformer models, or also for Mamba? The details of Brown training recipe have a number of different steps and parameters, which parts were used? The LR warmup over 375M tokens? The batch size linear increase from 32K tokens to 0.5M/1M/2M/3.5M tokens over the first 4-12B tokens of training?
For both transformers and Mamba. LR warmup 10% of the training steps (you can warm up for less as well, doesn't really change the end result). We don't do batch size ramp up, but keep batch size constant.
The training details in E.2 say "By default, the peak learning rate is the GPT3 specification" or "a peak value of 5× the GPT3 value" in the improved recipe. Depending on the GPT-3 model size, the learning rate ranged from 6e-4 to 6e-5 -- Were different learning rates used to train the differently sized Mamba models?
Yes, for the 125m Mamba model we use lr 3e-3, 350m we use lr = 1.5e-3 etc.
The readme mentions that the models were trained with AMP. Can I assume that the whole forward function was wrapped in torch.autocast?
Yes
Thanks for any additional details you can provide, I'm trying to replicate the training for the smaller sized models in preparation for training a large model
In general different hparams don't really change the end results, lr is prob the most important one.
Thanks, so just to be clear for the 790M model you do linear scheduling up to 5 * 2.5e-4 (the 2.5e-4 is from GPT-3 Large 760M) over the first 10% of tokens, and then cosine decay down to 1e-5?
Yes that's right.