CUDA OOM when finetuning meta-llama/Meta-Llama-3-8B-Instruct

Open zhj2022 opened this issue 1 year ago • 1 comments

I was trying to finetuning Meta-Llama-3-8B-Instruct using 4 gpus with the following command:

torchrun --nproc_per_node 4 -m training.run --output_dir llama3test --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --train_data training/toy_data --learning_rate 1e-5 --num_train_epochs 5 --per_device_train_batch_size 1 --dataloader_drop_last True --normalized True --temperature 0.02 --query_max_len 32 --passage_max_len 128 --train_group_size 2 --mode unified --attn cccc --attn_implementation sdpa --no_gen_gas --no_emb_gas --split_emb --bf16

and all 4 gpus are out of memory

W1120 13:47:19.838000 2742257 site-packages/torch/distributed/run.py:793] W1120 13:47:19.838000 2742257 site-packages/torch/distributed/run.py:793] ***************************************** W1120 13:47:19.838000 2742257 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1120 13:47:19.838000 2742257 site-packages/torch/distributed/run.py:793] ***************************************** 11/20/2024 13:47:23 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False 11/20/2024 13:47:23 - INFO - __main__ - Training/evaluation parameters CustomTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=True, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=False, emb_p_only=False, emb_q_only=False, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=1e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=llama3test/runs/Nov20_13-47-23_u, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=500, logging_strategy=steps, lora=False, loss_gen_factor=1.0, loss_gen_type=mixed, lr_scheduler_kwargs={}, lr_scheduler_type=linear, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mode=unified, mp_parameters=, neftune_noise_alpha=None, negatives_cross_device=False, no_cuda=False, no_emb_gas=True, no_gen_gas=True, num_train_epochs=5.0, optim=adamw_torch, optim_args=None, output_dir=llama3test, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_generative_bs=None, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, qlora=False, ray_scope=last, remove_unused_columns=True, report_to=['wandb'], resume_from_checkpoint=None, run_name=llama3test, save_on_each_node=False, save_only_model=False, save_safetensors=False, save_steps=500, save_strategy=steps, save_total_limit=None, seed=42, skip_memory_metrics=True, split_batches=False, split_emb=True, split_emb_full=False, temperature=0.02, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, ) 11/20/2024 13:47:23 - INFO - __main__ - Model parameters ModelArguments(model_name_or_path='meta-llama/Meta-Llama-3-8B-Instruct', config_name=None, tokenizer_name=None, pooling_method='weightedmean', normalized=True, attn_implementation='sdpa', attn='cccc', projection=None) 11/20/2024 13:47:23 - INFO - __main__ - Data parameters DataArguments(train_data='training/toy_data', train_group_size=2, query_max_len=32, passage_max_len=128, generative_max_len=None, max_example_num_per_dataset=100000000, num_samples=None, use_unique_indices=False, prefixlm=False) 11/20/2024 13:47:23 - INFO - __main__ - Using GradCache with chunk size 1 /home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/huggingface_hub/file_download.py:797: FutureWarning: resume_downloadis deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True. warnings.warn( 11/20/2024 13:47:24 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, 16-bits training: False /home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/huggingface_hub/file_download.py:797: FutureWarning: resume_downloadis deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True. warnings.warn( 11/20/2024 13:47:24 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, 16-bits training: False 11/20/2024 13:47:24 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: False /home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/huggingface_hub/file_download.py:797: FutureWarning: resume_downloadis deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True. warnings.warn( /home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/huggingface_hub/file_download.py:797: FutureWarning: resume_downloadis deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True`. warnings.warn( Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 11/20/2024 13:47:24 - INFO - main - Config: LlamaConfig { "_name_or_path": "meta-llama/Meta-Llama-3-8B-Instruct", "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128009, "hidden_act": "silu", "hidden_size": 4096, "id2label": { "0": "LABEL_0" }, "initializer_range": 0.02, "intermediate_size": 14336, "label2id": { "LABEL_0": 0 }, "max_position_embeddings": 8192, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.37.2", "use_cache": true, "vocab_size": 128256 }

11/20/2024 13:47:24 - INFO - main - Set pad token to bos token: <|begin_of_text|> 11/20/2024 13:47:24 - INFO - main - Loading dataset training/toy_data/toy_data_generative.jsonl Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 11/20/2024 13:47:25 - INFO - main - Loading dataset training/toy_data/toy_data_embedding.jsonl 11/20/2024 13:47:26 - INFO - main - Filtering out embedding samples with too long instructions for training/toy_data/toy_data_embedding.jsonl 11/20/2024 13:47:26 - INFO - main - Unified mode: 10 embedding samples, 10 generative samples Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 5.85it/s] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 6.50it/s] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 6.31it/s] Created GritLM: torch.bfloat16 dtype, weightedmean pool, unified mode, cccc attn Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 6.44it/s] Created GritLM: torch.bfloat16 dtype, weightedmean pool, unified mode, cccc attn Created GritLM: torch.bfloat16 dtype, weightedmean pool, unified mode, cccc attn Created GritLM: torch.bfloat16 dtype, weightedmean pool, unified mode, cccc attn 11/20/2024 13:47:32 - INFO - main - Starting training 11/20/2024 13:47:34 - INFO - training.gradcache_trainer - * Running training * 11/20/2024 13:47:34 - INFO - training.gradcache_trainer - Num examples = 10 11/20/2024 13:47:34 - INFO - training.gradcache_trainer - Num Epochs = 5 11/20/2024 13:47:34 - INFO - training.gradcache_trainer - Instantaneous batch size per device = 1 11/20/2024 13:47:34 - INFO - training.gradcache_trainer - Total train batch size (w. parallel, distributed & accumulation) = 4 11/20/2024 13:47:34 - INFO - training.gradcache_trainer - Gradient Accumulation steps = 1 11/20/2024 13:47:34 - INFO - training.gradcache_trainer - Total optimization steps = 10 11/20/2024 13:47:34 - INFO - training.gradcache_trainer - Number of trainable parameters = 8,030,261,248 wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. wandb: Currently logged in as: hongjizhang183 (hongjizhang183-shanghai-jiao-tong-university). Use `wandb login --relogin` to force relogin wandb: Tracking run with wandb version 0.18.7 wandb: Run data is saved locally in /home/hongjizhang/gritlm/gritlm/wandb/run-20241120_134735-vujcrvoq wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run leafy-moon-48 wandb: ⭐️ View project at https://wandb.ai/hongjizhang183-shanghai-jiao-tong-university/huggingface wandb: 🚀 View run at https://wandb.ai/hongjizhang183-shanghai-jiao-tong-university/huggingface/runs/vujcrvoq 0%| | 0/10 [00:00<?, ?it/s][rank1]:[W1120 13:47:36.928805439 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank2]:[W1120 13:47:36.929845239 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank0]:[W1120 13:47:36.936399679 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank3]:[W1120 13:47:36.938326946 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank2]: Traceback (most recent call last): [rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/runpy.py", line 196, in _run_module_as_main [rank2]: return _run_code(code, main_globals, None, [rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/runpy.py", line 86, in _run_code [rank2]: exec(code, run_globals) [rank2]: File "/home/hongjizhang/gritlm/gritlm/training/run.py", line 438, in [rank2]: main() [rank2]: File "/home/hongjizhang/gritlm/gritlm/training/run.py", line 419, in main [rank2]: trainer.train() [rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train [rank2]: return inner_training_loop( [rank2]: File "/home/hongjizhang/gritlm/gritlm/training/gradcache_trainer.py", line 766, in _inner_training_loop [rank2]: self.optimizer.step() [rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/accelerate/optimizer.py", line 145, in step [rank2]: self.optimizer.step(closure) [rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 137, in wrapper [rank2]: return func.get(opt, opt.class)(*args, **kwargs) [rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 487, in wrapper [rank2]: out = func(*args, **kwargs) [rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 91, in _use_grad [rank2]: ret = func(self, *args, **kwargs) [rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 220, in step [rank2]: adamw( [rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 154, in maybe_fallback [rank2]: return func(*args, **kwargs) [rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 782, in adamw [rank2]: func( [rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 606, in _multi_tensor_adamw [rank2]: exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs) [rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 2 has a total capacity of 79.33 GiB of which 22.00 MiB is free. Including non-PyTorch memory, this process has 79.30 GiB memory in use. Of the allocated memory 77.58 GiB is allocated by PyTorch, and 501.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) Traceback (most recent call last): File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/hongjizhang/gritlm/gritlm/training/run.py", line 438, in main() File "/home/hongjizhang/gritlm/gritlm/training/run.py", line 419, in main trainer.train() File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/home/hongjizhang/gritlm/gritlm/training/gradcache_trainer.py", line 766, in _inner_training_loop self.optimizer.step() File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/accelerate/optimizer.py", line 145, in step self.optimizer.step(closure) File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 137, in wrapper return func.get(opt, opt.class)(*args, **kwargs) File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 487, in wrapper out = func(*args, **kwargs) File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 91, in _use_grad ret = func(self, *args, **kwargs) File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 220, in step adamw( File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 154, in maybe_fallback return func(*args, **kwargs) File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 782, in adamw func( File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 606, in _multi_tensor_adamw exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 79.33 GiB of which 12.00 MiB is free. Including non-PyTorch memory, this process has 79.31 GiB memory in use. Of the allocated memory 77.72 GiB is allocated by PyTorch, and 463.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [rank0]: Traceback (most recent call last): [rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/runpy.py", line 196, in _run_module_as_main [rank0]: return _run_code(code, main_globals, None, [rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/runpy.py", line 86, in _run_code [rank0]: exec(code, run_globals) [rank0]: File "/home/hongjizhang/gritlm/gritlm/training/run.py", line 438, in [rank0]: main() [rank0]: File "/home/hongjizhang/gritlm/gritlm/training/run.py", line 419, in main [rank0]: trainer.train() [rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train [rank0]: return inner_training_loop( [rank0]: File "/home/hongjizhang/gritlm/gritlm/training/gradcache_trainer.py", line 766, in _inner_training_loop [rank0]: self.optimizer.step() [rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/accelerate/optimizer.py", line 145, in step [rank0]: self.optimizer.step(closure) [rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 137, in wrapper [rank0]: return func.get(opt, opt.class)(*args, **kwargs) [rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 487, in wrapper [rank0]: out = func(*args, **kwargs) [rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 91, in _use_grad [rank0]: ret = func(self, *args, **kwargs) [rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 220, in step [rank0]: adamw( [rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 154, in maybe_fallback [rank0]: return func(*args, **kwargs) [rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 782, in adamw [rank0]: func( [rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 606, in _multi_tensor_adamw [rank0]: exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs) [rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 79.33 GiB of which 12.00 MiB is free. Including non-PyTorch memory, this process has 79.31 GiB memory in use. Of the allocated memory 77.72 GiB is allocated by PyTorch, and 463.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [rank1]: Traceback (most recent call last): [rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/runpy.py", line 196, in _run_module_as_main [rank1]: return _run_code(code, main_globals, None, [rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/runpy.py", line 86, in _run_code [rank1]: exec(code, run_globals) [rank1]: File "/home/hongjizhang/gritlm/gritlm/training/run.py", line 438, in [rank1]: main() [rank1]: File "/home/hongjizhang/gritlm/gritlm/training/run.py", line 419, in main [rank1]: trainer.train() [rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train [rank1]: return inner_training_loop( [rank1]: File "/home/hongjizhang/gritlm/gritlm/training/gradcache_trainer.py", line 766, in _inner_training_loop [rank1]: self.optimizer.step() [rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/accelerate/optimizer.py", line 145, in step [rank1]: self.optimizer.step(closure) [rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 137, in wrapper [rank1]: return func.get(opt, opt.class)(*args, **kwargs) [rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 487, in wrapper [rank1]: out = func(*args, **kwargs) [rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 91, in _use_grad [rank1]: ret = func(self, *args, **kwargs) [rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 220, in step [rank1]: adamw( [rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 154, in maybe_fallback [rank1]: return func(*args, **kwargs) [rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 782, in adamw [rank1]: func( [rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 606, in _multi_tensor_adamw [rank1]: exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs) [rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 1 has a total capacity of 79.33 GiB of which 38.00 MiB is free. Including non-PyTorch memory, this process has 79.28 GiB memory in use. Of the allocated memory 77.61 GiB is allocated by PyTorch, and 453.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [rank3]: Traceback (most recent call last): [rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/runpy.py", line 196, in _run_module_as_main [rank3]: return _run_code(code, main_globals, None, [rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/runpy.py", line 86, in _run_code [rank3]: exec(code, run_globals) [rank3]: File "/home/hongjizhang/gritlm/gritlm/training/run.py", line 438, in [rank3]: main() [rank3]: File "/home/hongjizhang/gritlm/gritlm/training/run.py", line 419, in main [rank3]: trainer.train() [rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train [rank3]: return inner_training_loop( [rank3]: File "/home/hongjizhang/gritlm/gritlm/training/gradcache_trainer.py", line 766, in _inner_training_loop [rank3]: self.optimizer.step() [rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/accelerate/optimizer.py", line 145, in step [rank3]: self.optimizer.step(closure) [rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 137, in wrapper [rank3]: return func.get(opt, opt.class)(*args, **kwargs) [rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 487, in wrapper [rank3]: out = func(*args, **kwargs) [rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 91, in _use_grad [rank3]: ret = func(self, *args, **kwargs) [rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 220, in step [rank3]: adamw( [rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 154, in maybe_fallback [rank3]: return func(*args, kwargs) [rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 782, in adamw [rank3]: func( [rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 606, in _multi_tensor_adamw [rank3]: exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs) [rank3]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 3 has a total capacity of 79.33 GiB of which 34.00 MiB is free. Including non-PyTorch memory, this process has 79.29 GiB memory in use. Of the allocated memory 77.72 GiB is allocated by PyTorch, and 442.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) W1120 13:47:39.700000 2742257 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2742332 closing signal SIGTERM W1120 13:47:39.701000 2742257 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2742333 closing signal SIGTERM W1120 13:47:39.702000 2742257 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2742334 closing signal SIGTERM wandb: 🚀 View run leafy-moon-48 at: https://wandb.ai/hongjizhang183-shanghai-jiao-tong-university/huggingface/runs/vujcrvoq wandb: Find logs at: wandb/run-20241120_134735-vujcrvoq/logs E1120 13:47:40.167000 2742257 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 3 (pid: 2742335) of binary: /home/hongjizhang/.conda/envs/gritlm/bin/python Traceback (most recent call last): File "/home/hongjizhang/.conda/envs/gritlm/bin/torchrun", line 8, in sys.exit(main()) File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init**.py", line 355, in wrapper return f(*args, kwargs) File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

training.run FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-11-20_13:47:39 host : u rank : 3 (local_rank: 3) exitcode : 1 (pid: 2742335) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================`

GPU used are NVIDIA A800-SXM4-80GB which should not be OOM when loading a 8B model in bf16 precision in principle. I don't know the reason why a Llama-3-8B model takes so much memory space.

Nov 20 '24 06:11 zhj2022

gritlm gritlm copied to clipboard

CUDA OOM when finetuning meta-llama/Meta-Llama-3-8B-Instruct

training.run FAILED

Failures: <NO_OTHER_FAILURES>

gritlm
gritlm copied to clipboard