llm-foundry
llm-foundry copied to clipboard
how to run in v100 GPU
when I run with “composer train.py yamls/mpt/125m.yaml train_loader.dataset.split=train_small eval_loader.dataset.split=val_small”, I get the error,my GPU is V100
Traceback (most recent call last):
File "
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 254, in
Hi @sysusicily , what versions of torch and CUDA are you using? And what precision?
I would recommend using the Docker image in the README.md, and using precision: amp_fp16 (since BF16 is not supported on V100). If you run into an OOM, you can try lowering the microbatch size or use device_train_microbatch_size: auto and let Composer tune it for you.
Hi @sysusicily , what versions of torch and CUDA are you using? And what
precision?I would recommend using the Docker image in the README.md, and using
precision: amp_fp16(since BF16 is not supported on V100). If you run into an OOM, you can try lowering the microbatch size or usedevice_train_microbatch_size: autoand let Composer tune it for you.
CUDA11.7+pytorch1.13.1, and I have already used amp_fp16, but I still have this error.
Hi @sysusicily , what versions of torch and CUDA are you using? And what
precision?I would recommend using the Docker image in the README.md, and using
precision: amp_fp16(since BF16 is not supported on V100). If you run into an OOM, you can try lowering the microbatch size or usedevice_train_microbatch_size: autoand let Composer tune it for you.
If I change the attn_impl parameter from Triton to Torch, it can run on V100 GPU. Does this mean that V100 GPU does not support Triton?
Ok good to know... I would try attn_impl: flash as well to see if that works. But we haven't done much testing on V100. I would have to check upstream to see if the triton compiler supports V100 (and which version, we are pinned to triton==2.0.0.dev20221202)
hi @sysusicily I'm fine-tuning with 6 V100 GPUs, and I have a question regarding the time. The fine-tuning process is extremely slow for me. I'm using fp16 and attn_impl: torch, with a global_train_batch_size of 12 and device_train_microbatch_size automatically set to 2 (device_train_microbatch_size: auto). Even after 15 hours, I haven't finished training one-third of an epoch (500k rows of data). I would greatly appreciate it if you could provide me with the size of your training data and the time it took to complete the training.
Hi @Louis-y-nlp , I can't guarantee high performance on V100 cards as we mainly focus on A100+H100s. To spot check your throughput numbers, you can check out our throughput table here: https://github.com/mosaicml/llm-foundry/tree/main/scripts/train/benchmarking. You can find throughputs for different model sizes, batch sizes, and cluster sizes, and the column Throughput (T/s) will tell you the tokens_per_second.
I would expect V100s to be 2-3x slower than the A100 numbers you see, maybe 5x slower if you are using torch attention rather than triton.
@Louis-y-nlp something that just occured to me: does your 6xV100 system have good GPU-GPU interconnect? We use FSDP to handle training of large models like MPT-7B, and it relies on NVSwitch between GPUs to shard model weights and move them around quickly.
If you do not have NVSwitch, I would expect training to go very slowly with FSDP. But also if you turn off FSDP, you probably cannot fit a 7B model + optimizer state on a single V100-[16GB or 32GB].
If this ends up being the issue, I believe you will have to find to a system with NVSwitch or ideally an 8xA100-40GB node to get good performance.
Thank you for your response. I have switched to the 8xA100 40G machine and it is running smoothly now.