llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

GPU memory issue

Open zhranj opened this issue 2 years ago • 2 comments

I used the train script from README to kick off 125M model training for 10 batches, exactly as in the example and was surprised to see this used almost all of my GPU memory (3080 16GB)

Here's the output from the last batch: [batch=10/10]: Train time/batch: 9 Train time/sample: 2304 Train time/batch_in_epoch: 9 Train time/sample_in_epoch: 2304 Train time/token: 4718592 Train time/token_in_epoch: 4718592 Train memory/allocated_mem: 3.8951 Train memory/active_mem: 3.8951 Train memory/inactive_mem: 5.5924 Train memory/reserved_mem: 13.5500 Train memory/alloc_retries: 3 Train trainer/device_train_microbatch_size: 8 Train loss/train/total: 10.9654 Train metrics/train/LanguageCrossEntropy: 10.9651 Train metrics/train/LanguagePerplexity: 57818.7188 Train time/train: 0.0606 Train time/val: 0.0000 Train time/total: 0.0606 Train lr-DecoupledAdamW/group0: 0.0001 Train time/remaining_estimate: 0.0000

I would ideally want to train/fine-tune a larger model, but even trying the 350M model fails:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.54 GiB (GPU 0; 16.00 GiB total capacity; 13.53 GiB already allocated; 6.00 MiB free; 14.01 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.

I checked with nvidia-smi before running the train script and the GPU had no memory allocated.

zhranj avatar May 12 '23 12:05 zhranj

Hi @zhranj , please try reducing the device_train_microbatch_size. This will keep the math the same (within numerics) but will reduce the memory usage. The current settings have all been tuned to A100-40GB cards.

If you're unsure how to tune the microbatch size, try device_train_microbatch_size: auto and Composer will automatically catch OOMs and tune the microbatch size to fit on your hardware.

In general, using close-to-max GPU memory is not a bad thing, in fact you should try to use as much as possible (aka as large of a microbatch size as possible) to maximize throughput.

Let us know how this goes!

abhi-mosaic avatar May 12 '23 16:05 abhi-mosaic

Thanks for the answer @abhi-mosaic, this helped with the 350M, but trying 1b or 3b or 7b gives a different error now which I'm not sure how to categorize:

root@8e801cb1b9dc:/workspaces/llm-foundry/scripts# composer train/train.py train/yamls/mpt/1b.yaml data_local=my-copy-c4 train_loader.dataset.split=train_small eval_loader.dataset.split=val_small max_duration=10ba eval_interval=0 save_folder=mpt-1b Initializing model... cfg.n_params=1.32e+09 Building train loader... Building eval loader... Building trainer... /usr/lib/python3/dist-packages/composer/callbacks/speed_monitor.py:120: UserWarning: gpu_flop count not found for None with precision: amp_bf16; MFU cannot be calculated and reported. gpu_flops_available can be manuallyoverridden by setting gpu_flops_available in SpeedMonitor. warnings.warn( Logging config... data_local: my-copy-c4 data_remote: null max_seq_len: 2048 global_seed: 17 run_name: llm model: name: mpt_causal_lm init_device: meta d_model: 2048 n_heads: 16 n_layers: 24 expansion_ratio: 4 max_seq_len: ${max_seq_len} vocab_size: 50368 attn_config: attn_impl: triton tokenizer: name: EleutherAI/gpt-neox-20b kwargs: model_max_length: ${max_seq_len} train_loader: name: text dataset: local: ${data_local} remote: ${data_remote} split: train_small shuffle: true max_seq_len: ${max_seq_len} shuffle_seed: ${global_seed} drop_last: true num_workers: 8 eval_loader: name: text dataset: local: ${data_local} remote: ${data_remote} split: val_small shuffle: false max_seq_len: ${max_seq_len} shuffle_seed: ${global_seed} drop_last: false num_workers: 8 scheduler: name: cosine_with_warmup t_warmup: 100ba alpha_f: 0.1 optimizer: name: decoupled_adamw lr: 0.0002 betas:

  • 0.9
  • 0.95 eps: 1.0e-08 weight_decay: 0.0 algorithms: gradient_clipping: clipping_type: norm clipping_threshold: 1.0 max_duration: 10ba eval_interval: 0 eval_first: false eval_subset_num_batches: -1 global_train_batch_size: 512 seed: ${global_seed} device_eval_batch_size: 4 device_train_microbatch_size: 2 precision: amp_bf16 fsdp_config: sharding_strategy: FULL_SHARD mixed_precision: PURE activation_checkpointing: false activation_checkpointing_reentrant: false activation_cpu_offload: false limit_all_gathers: true verbose: false progress_bar: false log_to_console: true console_log_interval: 1ba callbacks: speed_monitor: window_size: 10 lr_monitor: {} memory_monitor: {} runtime_estimator: {} save_folder: mpt-1b dist_timeout: 600.0 n_gpus: 1 device_train_batch_size: 512 device_train_grad_accum: 256 n_params: 1315950592

Starting training...


Config: enabled_algorithms/GradientClipping: true node_name: unknown because NODENAME environment variable not set num_gpus_per_node: 1 num_nodes: 1 rank_zero_seed: 17


Traceback (most recent call last): File "", line 21, in _bwd_kernel KeyError: ('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.bfloat16, torch.bfloat16, torch.bfloat16, None, torch.bfloat16, torch.float32, torch.bfloat16, torch.bfloat16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('none', True, 128, False, True, True, True, 128, 128), (True, True, True, (False,), True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/workspaces/llm-foundry/scripts/train/train.py", line 254, in main(cfg) File "/workspaces/llm-foundry/scripts/train/train.py", line 243, in main trainer.fit() File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1766, in fit self._train_loop() File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1940, in _train_loop total_loss_dict = self._train_batch(use_grad_scaling) File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2115, in _train_batch optimizer.step(closure=lambda **kwargs: self._train_microbatches( File "/usr/lib/python3/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper return wrapped(*args, **kwargs) File "/usr/lib/python3/dist-packages/torch/optim/optimizer.py", line 140, in wrapper out = func(*args, **kwargs) File "/usr/lib/python3/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/usr/lib/python3/dist-packages/composer/optim/decoupled_weight_decay.py", line 288, in step loss = closure() File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2115, in optimizer.step(closure=lambda **kwargs: self._train_microbatches( File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2213, in _train_microbatches microbatch_loss_dict = self._train_microbatch(use_grad_scaling, current_batch_size, is_final_microbatch) File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2340, in _train_microbatch microbatch_loss.backward(create_graph=self._backwards_create_graph) File "/usr/lib/python3/dist-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/usr/lib/python3/dist-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/usr/lib/python3/dist-packages/torch/autograd/function.py", line 267, in apply return user_fn(self, *args) File "/usr/lib/python3/dist-packages/flash_attn/flash_attn_triton.py", line 827, in backward _flash_attn_backward(do, q, k, v, o, lse, dq, dk, dv, File "/usr/lib/python3/dist-packages/flash_attn/flash_attn_triton.py", line 694, in _flash_attn_backward _bwd_kernel[grid]( File "/usr/lib/python3/dist-packages/triton/runtime/jit.py", line 106, in launcher return self.run(*args, grid=grid, **kwargs) File "/usr/lib/python3/dist-packages/triton/runtime/autotuner.py", line 73, in run timings = {config: self._bench(*args, config=config, **kwargs) File "/usr/lib/python3/dist-packages/triton/runtime/autotuner.py", line 73, in timings = {config: self._bench(*args, config=config, **kwargs) File "/usr/lib/python3/dist-packages/triton/runtime/autotuner.py", line 63, in _bench return do_bench(kernel_call) File "/usr/lib/python3/dist-packages/triton/testing.py", line 140, in do_bench fn() File "/usr/lib/python3/dist-packages/triton/runtime/autotuner.py", line 62, in kernel_call self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **current) File "/usr/lib/python3/dist-packages/triton/runtime/autotuner.py", line 200, in run return self.fn.run(*args, **kwargs) File "", line 43, in _bwd_kernel RuntimeError: Triton Error [CUDA]: invalid argument ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1. Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately. Global rank 0 (PID 5236) exited with code 1 ERROR:composer.cli.launcher:Global rank 0 (PID 5236) exited with code 1

zhranj avatar May 12 '23 19:05 zhranj

Hi @zhranj , this looks like an issue with Triton and your particular hardware.

Could you try these two options:

  1. using our Docker image from the top README: mosaicml/pytorch:1.13.1_cu117-python3.10-ubuntu20.04
  2. if it still fails, revert attn_impl: triton to attn_impl: torch which will just use PyTorch' attention and I would expect this to work on any NVIDIA GPU.

abhi-mosaic avatar May 18 '23 20:05 abhi-mosaic

Note:

Traceback (most recent call last):
File "", line 21, in _bwd_kernel
KeyError: ('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.bfloat16, torch.bfloat16, torch.bfloat16, None, torch.bfloat16, torch.float32, torch.bfloat16, torch.bfloat16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('none', True, 128, False, True, True, True, 128, 128), (True, True, True, (False,), True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False)))

means there is a mismatch with the triton version required for llm-foundry and the triton version installed on your system. As Abhi noted, this means the environment setup is incorrect.

vchiley avatar May 27 '23 19:05 vchiley