transformers Language modeling examples do not show how to do multi-gpu training / fine-tuning

System Info

transformers version: 4.41.2
Platform: Linux-5.15.0-1042-nvidia-x86_64-with-glibc2.35
Python version: 3.9.18
Huggingface_hub version: 0.23.3
Safetensors version: 0.4.2
Accelerate version: 0.31.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed

Who can help?

@muellerz @stevhliu

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

n/a

Expected behavior

The run_clm.py and other related scripts in:

https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling

notionally support training / fine-tuning of models whose gradients are too large to fit on a single GPU, if you believe their CLI. However there is no example showing how to actually do that.

For instance, accelerate estimate-memory says training the Mistral-7B family with Adam takes roughly 55 GB with float16, which is more memory than a single 40GB A100 has. So I'd need to use more than one GPU.

Would it be possible to modify the language_modeling documentation to explain how to do that?

Jun 07 '24 18:06 csiefer2

Hi @csiefer2, thanks for opening this issue!

@muellerzr and @stevhliu are best placed to comment on this in general.

In the meantime, you can find some accelerate docs on distributed training here: https://huggingface.co/docs/transformers/en/accelerate#distributed-training-with--accelerate

Jun 07 '24 19:06 amyeroberts

Just launch the scripts with accelerate launch or torchrun, no need to do anything else

My attempts to do that have not been successful... run_clm seems happy to fill up the memory of however many GPUs I tell it to use and then die when it finally exceeds the memory limits. That's why I was asking the question :)

Jun 07 '24 20:06 csiefer2

For instance, with the v4.41-release branch of transformers off of github, if I grab 4 A100s with 80GB of RAM each and do this:

torchrun --nproc-per-node 4 ./run_clm.py --model_name_or_path=mistralai/Mistral-7B-Instruct-v0.2 --train_file=myfile1.txt  --validation_file=myfile2.txt --do_train --do_eval --output_dir=mydir --report_to none

run_clm.py runs itself out of memory... with a model that accelerate tells me I could have fit the gradients on 1-2 GPUs (depending on whether I'm float32 or float16).

I get errors like:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU 3 has a total capacity of 79.15 GiB of which 95.25 MiB is free. Including non-PyTorch memory, this process has 79.04 GiB memory in use. Of the allocated memory 77.23 GiB is allocated by PyTorch, and 348.16 MiB is reserved by PyTorch but unallocated.

Clearly I'm doing something wrong.

Jun 07 '24 21:06 csiefer2

@amyeroberts That example doesn't use the trainer.train() function, which is what I'd (ideally) like to use.

Jun 10 '24 17:06 csiefer2

@csiefer2 I can't comment on the memory calculation from accelerate (cc @muellerzr here) but I'm assuming it this is just for the weights of the model + gradient on the forward/backward pass? You'll also need to account for the memory requirements of loading the data onto the GPU. What batch size are you using?

Jun 10 '24 17:06 amyeroberts

@amyeroberts In the example above, I wasn't specifying it, but I've tried running with a batch size of 1 before and saw the same results. The training/evaluation data I used above is a whole 1.2M / 11k when stored on disk in a text file, so I suspect this isn't a data size issue.

Jun 10 '24 17:06 csiefer2

Thanks for the feedback!

I think keeping these two topics (language modeling and distributed training) in separate docs is better. It sounds like the issue is more about setting up for distributed training, and not so much language modeling. But we can improve on the distributed training docs with an example use case featuring language modeling.

Jun 10 '24 20:06 stevhliu

@stevhliu That sounds perfectly reasonable. If you need someone to test out a revised training document to ensure that it works, I'd be a happy to help!

Jun 10 '24 20:06 csiefer2

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 08 '24 08:07 github-actions[bot]

+1

Jul 10 '24 15:07 csiefer2

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Aug 04 '24 08:08 github-actions[bot]

Thanks @stevhliu !

Aug 13 '24 15:08 csiefer2

Thanks for your patience! Working on redesigning the docs right now at https://github.com/huggingface/transformers/pull/31757 and I'll update the distributed training docs when I reach it 🙂

Aug 13 '24 16:08 stevhliu

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Oct 08 '24 08:10 github-actions[bot]

Bump!

Oct 08 '24 21:10 csiefer2

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Nov 02 '24 08:11 github-actions[bot]

transformers transformers copied to clipboard

Language modeling examples do not show how to do multi-gpu training / fine-tuning

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard