transformers
transformers copied to clipboard
Language modeling examples do not show how to do multi-gpu training / fine-tuning
System Info
transformersversion: 4.41.2- Platform: Linux-5.15.0-1042-nvidia-x86_64-with-glibc2.35
- Python version: 3.9.18
- Huggingface_hub version: 0.23.3
- Safetensors version: 0.4.2
- Accelerate version: 0.31.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
Who can help?
@muellerz @stevhliu
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
n/a
Expected behavior
The run_clm.py and other related scripts in:
https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling
notionally support training / fine-tuning of models whose gradients are too large to fit on a single GPU, if you believe their CLI. However there is no example showing how to actually do that.
For instance, accelerate estimate-memory says training the Mistral-7B family with Adam takes roughly 55 GB with float16, which is more memory than a single 40GB A100 has. So I'd need to use more than one GPU.
Would it be possible to modify the language_modeling documentation to explain how to do that?
Hi @csiefer2, thanks for opening this issue!
@muellerzr and @stevhliu are best placed to comment on this in general.
In the meantime, you can find some accelerate docs on distributed training here: https://huggingface.co/docs/transformers/en/accelerate#distributed-training-with--accelerate
Just launch the scripts with
accelerate launchortorchrun, no need to do anything else
My attempts to do that have not been successful... run_clm seems happy to fill up the memory of however many GPUs I tell it to use and then die when it finally exceeds the memory limits. That's why I was asking the question :)
For instance, with the v4.41-release branch of transformers off of github, if I grab 4 A100s with 80GB of RAM each and do this:
torchrun --nproc-per-node 4 ./run_clm.py --model_name_or_path=mistralai/Mistral-7B-Instruct-v0.2 --train_file=myfile1.txt --validation_file=myfile2.txt --do_train --do_eval --output_dir=mydir --report_to none
run_clm.py runs itself out of memory... with a model that accelerate tells me I could have fit the gradients on 1-2 GPUs (depending on whether I'm float32 or float16).
I get errors like:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU 3 has a total capacity of 79.15 GiB of which 95.25 MiB is free. Including non-PyTorch memory, this process has 79.04 GiB memory in use. Of the allocated memory 77.23 GiB is allocated by PyTorch, and 348.16 MiB is reserved by PyTorch but unallocated.
Clearly I'm doing something wrong.
@amyeroberts That example doesn't use the trainer.train() function, which is what I'd (ideally) like to use.
@csiefer2 I can't comment on the memory calculation from accelerate (cc @muellerzr here) but I'm assuming it this is just for the weights of the model + gradient on the forward/backward pass? You'll also need to account for the memory requirements of loading the data onto the GPU. What batch size are you using?
@amyeroberts In the example above, I wasn't specifying it, but I've tried running with a batch size of 1 before and saw the same results. The training/evaluation data I used above is a whole 1.2M / 11k when stored on disk in a text file, so I suspect this isn't a data size issue.
Thanks for the feedback!
I think keeping these two topics (language modeling and distributed training) in separate docs is better. It sounds like the issue is more about setting up for distributed training, and not so much language modeling. But we can improve on the distributed training docs with an example use case featuring language modeling.
@stevhliu That sounds perfectly reasonable. If you need someone to test out a revised training document to ensure that it works, I'd be a happy to help!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
+1
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Thanks @stevhliu !
Thanks for your patience! Working on redesigning the docs right now at https://github.com/huggingface/transformers/pull/31757 and I'll update the distributed training docs when I reach it 🙂
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Bump!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.