bitsandbytes icon indicating copy to clipboard operation
bitsandbytes copied to clipboard

More information on training from scratch and finetuning

Open nafizh opened this issue 2 years ago • 1 comments

Thanks for the great work!

I am looking for some additional information on using the library to train a model from scratch or fine-tuning. The only information I could find were in the appendices of the corresponding paper. Specifically,

  1. How many GPU's (and which) were used to train the Roberta model to train it from scratch at Appendix D?
  2. How many GPU's (and which) were used to fine-tune the Roberta-large model in the fine-tuning section at Appendix E?
  3. Is there any calculation done on how many GPUs you would need to train models with different # parameters with the library?
  4. If you have you any additional/updated insights for training from scratch or fine-tuning, that would be wonderful!

Thanks for your help!

nafizh avatar Sep 15 '22 18:09 nafizh

Thanks for the questions!

The model that we trained were autoregressive language models, only the corpus that we use was from RoBERTa. The baselines models come from the BASE layer paper.

For fine-tuning we use the fairseq RoBERTa large.

  1. The largest ones were trained on 128 V100 GPUs. The implementation was inefficient (not the one in this repo), but produced the exactly same results.
  2. 1 GPU was used for fine-tuning
  3. This varies widely since a large chunk of the memory is due to the activations/attention which are dependent on the sequence dimension and only partially on the model size.
  4. One insight not mentioned in the paper. The sheer depth of a model is a problem. So training a model with the same number of parameters is easier if it is wider and more shallow.

TimDettmers avatar Oct 10 '22 01:10 TimDettmers

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

github-actions[bot] avatar Dec 20 '23 16:12 github-actions[bot]