llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

how to run in v100 GPU

Open sysusicily opened this issue 2 years ago • 5 comments

when I run with “composer train.py yamls/mpt/125m.yaml train_loader.dataset.split=train_small eval_loader.dataset.split=val_small”, I get the error,my GPU is V100


Traceback (most recent call last): File "", line 21, in _bwd_kernel KeyError: ('2-.-0-.-0-1e8410f206c822547fb50e2ea86e45a6-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-42648570729a4835b21c1c18cebedbfe-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, None, torch.float16, torch.float32, torch.float16, torch.float16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('none', True, 64, False, True, True, True, 128, 128), (True, True, True, (False,), True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (False, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 254, in main(cfg) File "train.py", line 243, in main trainer.fit() File "/root/mpt-env/lib/python3.8/site-packages/composer/trainer/trainer.py", line 1766, in fit self._train_loop() File "/root/mpt-env/lib/python3.8/site-packages/composer/trainer/trainer.py", line 1940, in _train_loop total_loss_dict = self._train_batch(use_grad_scaling) File "/root/mpt-env/lib/python3.8/site-packages/composer/trainer/trainer.py", line 2118, in _train_batch self._train_microbatches(microbatches, total_loss_dict) File "/root/mpt-env/lib/python3.8/site-packages/composer/trainer/trainer.py", line 2213, in _train_microbatches microbatch_loss_dict = self._train_microbatch(use_grad_scaling, current_batch_size, is_final_microbatch) File "/root/mpt-env/lib/python3.8/site-packages/composer/trainer/trainer.py", line 2340, in _train_microbatch microbatch_loss.backward(create_graph=self._backwards_create_graph) File "/root/mpt-env/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/root/mpt-env/lib/python3.8/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/root/mpt-env/lib/python3.8/site-packages/torch/autograd/function.py", line 267, in apply return user_fn(self, *args) File "/root/mpt-env/lib/python3.8/site-packages/flash_attn/flash_attn_triton.py", line 827, in backward _flash_attn_backward(do, q, k, v, o, lse, dq, dk, dv, File "/root/mpt-env/lib/python3.8/site-packages/flash_attn/flash_attn_triton.py", line 694, in _flash_attn_backward _bwd_kernel[grid]( File "/root/mpt-env/lib/python3.8/site-packages/triton/runtime/jit.py", line 106, in launcher return self.run(*args, grid=grid, **kwargs) File "/root/mpt-env/lib/python3.8/site-packages/triton/runtime/autotuner.py", line 73, in run timings = {config: self._bench(*args, config=config, **kwargs) File "/root/mpt-env/lib/python3.8/site-packages/triton/runtime/autotuner.py", line 73, in timings = {config: self._bench(*args, config=config, **kwargs) File "/root/mpt-env/lib/python3.8/site-packages/triton/runtime/autotuner.py", line 63, in _bench return do_bench(kernel_call) File "/root/mpt-env/lib/python3.8/site-packages/triton/testing.py", line 140, in do_bench fn() File "/root/mpt-env/lib/python3.8/site-packages/triton/runtime/autotuner.py", line 62, in kernel_call self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **current) File "/root/mpt-env/lib/python3.8/site-packages/triton/runtime/autotuner.py", line 200, in run return self.fn.run(*args, **kwargs) File "", line 43, in _bwd_kernel RuntimeError: Triton Error [CUDA]: invalid argument ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1. Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately. Global rank 0 (PID 6259) exited with code 1 ERROR:composer.cli.launcher:Global rank 0 (PID 6259) exited with code 1

sysusicily avatar May 12 '23 03:05 sysusicily

Hi @sysusicily , what versions of torch and CUDA are you using? And what precision?

I would recommend using the Docker image in the README.md, and using precision: amp_fp16 (since BF16 is not supported on V100). If you run into an OOM, you can try lowering the microbatch size or use device_train_microbatch_size: auto and let Composer tune it for you.

abhi-mosaic avatar May 12 '23 03:05 abhi-mosaic

Hi @sysusicily , what versions of torch and CUDA are you using? And what precision?

I would recommend using the Docker image in the README.md, and using precision: amp_fp16 (since BF16 is not supported on V100). If you run into an OOM, you can try lowering the microbatch size or use device_train_microbatch_size: auto and let Composer tune it for you.

CUDA11.7+pytorch1.13.1, and I have already used amp_fp16, but I still have this error.

sysusicily avatar May 12 '23 04:05 sysusicily

Hi @sysusicily , what versions of torch and CUDA are you using? And what precision?

I would recommend using the Docker image in the README.md, and using precision: amp_fp16 (since BF16 is not supported on V100). If you run into an OOM, you can try lowering the microbatch size or use device_train_microbatch_size: auto and let Composer tune it for you.

If I change the attn_impl parameter from Triton to Torch, it can run on V100 GPU. Does this mean that V100 GPU does not support Triton?

sysusicily avatar May 12 '23 06:05 sysusicily

Ok good to know... I would try attn_impl: flash as well to see if that works. But we haven't done much testing on V100. I would have to check upstream to see if the triton compiler supports V100 (and which version, we are pinned to triton==2.0.0.dev20221202)

abhi-mosaic avatar May 12 '23 17:05 abhi-mosaic

hi @sysusicily I'm fine-tuning with 6 V100 GPUs, and I have a question regarding the time. The fine-tuning process is extremely slow for me. I'm using fp16 and attn_impl: torch, with a global_train_batch_size of 12 and device_train_microbatch_size automatically set to 2 (device_train_microbatch_size: auto). Even after 15 hours, I haven't finished training one-third of an epoch (500k rows of data). I would greatly appreciate it if you could provide me with the size of your training data and the time it took to complete the training.

Louis-y-nlp avatar May 17 '23 08:05 Louis-y-nlp

Hi @Louis-y-nlp , I can't guarantee high performance on V100 cards as we mainly focus on A100+H100s. To spot check your throughput numbers, you can check out our throughput table here: https://github.com/mosaicml/llm-foundry/tree/main/scripts/train/benchmarking. You can find throughputs for different model sizes, batch sizes, and cluster sizes, and the column Throughput (T/s) will tell you the tokens_per_second.

I would expect V100s to be 2-3x slower than the A100 numbers you see, maybe 5x slower if you are using torch attention rather than triton.

abhi-mosaic avatar May 31 '23 00:05 abhi-mosaic

@Louis-y-nlp something that just occured to me: does your 6xV100 system have good GPU-GPU interconnect? We use FSDP to handle training of large models like MPT-7B, and it relies on NVSwitch between GPUs to shard model weights and move them around quickly.

If you do not have NVSwitch, I would expect training to go very slowly with FSDP. But also if you turn off FSDP, you probably cannot fit a 7B model + optimizer state on a single V100-[16GB or 32GB].

If this ends up being the issue, I believe you will have to find to a system with NVSwitch or ideally an 8xA100-40GB node to get good performance.

abhi-mosaic avatar May 31 '23 01:05 abhi-mosaic

Thank you for your response. I have switched to the 8xA100 40G machine and it is running smoothly now.

Louis-y-nlp avatar May 31 '23 02:05 Louis-y-nlp