llm-foundry
llm-foundry copied to clipboard
GPU memory issue
I used the train script from README to kick off 125M model training for 10 batches, exactly as in the example and was surprised to see this used almost all of my GPU memory (3080 16GB)
Here's the output from the last batch: [batch=10/10]: Train time/batch: 9 Train time/sample: 2304 Train time/batch_in_epoch: 9 Train time/sample_in_epoch: 2304 Train time/token: 4718592 Train time/token_in_epoch: 4718592 Train memory/allocated_mem: 3.8951 Train memory/active_mem: 3.8951 Train memory/inactive_mem: 5.5924 Train memory/reserved_mem: 13.5500 Train memory/alloc_retries: 3 Train trainer/device_train_microbatch_size: 8 Train loss/train/total: 10.9654 Train metrics/train/LanguageCrossEntropy: 10.9651 Train metrics/train/LanguagePerplexity: 57818.7188 Train time/train: 0.0606 Train time/val: 0.0000 Train time/total: 0.0606 Train lr-DecoupledAdamW/group0: 0.0001 Train time/remaining_estimate: 0.0000
I would ideally want to train/fine-tune a larger model, but even trying the 350M model fails:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.54 GiB (GPU 0; 16.00 GiB total capacity; 13.53 GiB already allocated; 6.00 MiB free; 14.01 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
I checked with nvidia-smi before running the train script and the GPU had no memory allocated.
Hi @zhranj , please try reducing the device_train_microbatch_size. This will keep the math the same (within numerics) but will reduce the memory usage. The current settings have all been tuned to A100-40GB cards.
If you're unsure how to tune the microbatch size, try device_train_microbatch_size: auto and Composer will automatically catch OOMs and tune the microbatch size to fit on your hardware.
In general, using close-to-max GPU memory is not a bad thing, in fact you should try to use as much as possible (aka as large of a microbatch size as possible) to maximize throughput.
Let us know how this goes!
Thanks for the answer @abhi-mosaic, this helped with the 350M, but trying 1b or 3b or 7b gives a different error now which I'm not sure how to categorize:
root@8e801cb1b9dc:/workspaces/llm-foundry/scripts# composer train/train.py train/yamls/mpt/1b.yaml data_local=my-copy-c4 train_loader.dataset.split=train_small eval_loader.dataset.split=val_small max_duration=10ba eval_interval=0 save_folder=mpt-1b Initializing model... cfg.n_params=1.32e+09 Building train loader... Building eval loader... Building trainer... /usr/lib/python3/dist-packages/composer/callbacks/speed_monitor.py:120: UserWarning: gpu_flop count not found for None with precision: amp_bf16; MFU cannot be calculated and reported. gpu_flops_available can be manuallyoverridden by setting gpu_flops_available in SpeedMonitor. warnings.warn( Logging config... data_local: my-copy-c4 data_remote: null max_seq_len: 2048 global_seed: 17 run_name: llm model: name: mpt_causal_lm init_device: meta d_model: 2048 n_heads: 16 n_layers: 24 expansion_ratio: 4 max_seq_len: ${max_seq_len} vocab_size: 50368 attn_config: attn_impl: triton tokenizer: name: EleutherAI/gpt-neox-20b kwargs: model_max_length: ${max_seq_len} train_loader: name: text dataset: local: ${data_local} remote: ${data_remote} split: train_small shuffle: true max_seq_len: ${max_seq_len} shuffle_seed: ${global_seed} drop_last: true num_workers: 8 eval_loader: name: text dataset: local: ${data_local} remote: ${data_remote} split: val_small shuffle: false max_seq_len: ${max_seq_len} shuffle_seed: ${global_seed} drop_last: false num_workers: 8 scheduler: name: cosine_with_warmup t_warmup: 100ba alpha_f: 0.1 optimizer: name: decoupled_adamw lr: 0.0002 betas:
- 0.9
- 0.95 eps: 1.0e-08 weight_decay: 0.0 algorithms: gradient_clipping: clipping_type: norm clipping_threshold: 1.0 max_duration: 10ba eval_interval: 0 eval_first: false eval_subset_num_batches: -1 global_train_batch_size: 512 seed: ${global_seed} device_eval_batch_size: 4 device_train_microbatch_size: 2 precision: amp_bf16 fsdp_config: sharding_strategy: FULL_SHARD mixed_precision: PURE activation_checkpointing: false activation_checkpointing_reentrant: false activation_cpu_offload: false limit_all_gathers: true verbose: false progress_bar: false log_to_console: true console_log_interval: 1ba callbacks: speed_monitor: window_size: 10 lr_monitor: {} memory_monitor: {} runtime_estimator: {} save_folder: mpt-1b dist_timeout: 600.0 n_gpus: 1 device_train_batch_size: 512 device_train_grad_accum: 256 n_params: 1315950592
Starting training...
Config: enabled_algorithms/GradientClipping: true node_name: unknown because NODENAME environment variable not set num_gpus_per_node: 1 num_nodes: 1 rank_zero_seed: 17
Traceback (most recent call last):
File "
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/workspaces/llm-foundry/scripts/train/train.py", line 254, in
Hi @zhranj , this looks like an issue with Triton and your particular hardware.
Could you try these two options:
- using our Docker image from the top README:
mosaicml/pytorch:1.13.1_cu117-python3.10-ubuntu20.04 - if it still fails, revert
attn_impl: tritontoattn_impl: torchwhich will just use PyTorch' attention and I would expect this to work on any NVIDIA GPU.
Note:
Traceback (most recent call last):
File "", line 21, in _bwd_kernel
KeyError: ('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.bfloat16, torch.bfloat16, torch.bfloat16, None, torch.bfloat16, torch.float32, torch.bfloat16, torch.bfloat16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('none', True, 128, False, True, True, True, 128, 128), (True, True, True, (False,), True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False)))
means there is a mismatch with the triton version required for llm-foundry and the triton version installed on your system. As Abhi noted, this means the environment setup is incorrect.