bitsandbytes
bitsandbytes copied to clipboard
More information on training from scratch and finetuning
Thanks for the great work!
I am looking for some additional information on using the library to train a model from scratch or fine-tuning. The only information I could find were in the appendices of the corresponding paper. Specifically,
- How many GPU's (and which) were used to train the Roberta model to train it from scratch at Appendix D?
- How many GPU's (and which) were used to fine-tune the Roberta-large model in the fine-tuning section at Appendix E?
- Is there any calculation done on how many GPUs you would need to train models with different # parameters with the library?
- If you have you any additional/updated insights for training from scratch or fine-tuning, that would be wonderful!
Thanks for your help!
Thanks for the questions!
The model that we trained were autoregressive language models, only the corpus that we use was from RoBERTa. The baselines models come from the BASE layer paper.
For fine-tuning we use the fairseq RoBERTa large.
- The largest ones were trained on 128 V100 GPUs. The implementation was inefficient (not the one in this repo), but produced the exactly same results.
- 1 GPU was used for fine-tuning
- This varies widely since a large chunk of the memory is due to the activations/attention which are dependent on the sequence dimension and only partially on the model size.
- One insight not mentioned in the paper. The sheer depth of a model is a problem. So training a model with the same number of parameters is easier if it is wider and more shallow.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.