LESS In Step 1: Warmup training，multiple gpu's trainning

I want to train with multiple gpu's, besides setting the export header="torchrun --nproc_per_node 4 --nnodes 1 and export CUDA_VISIBLE_DEVICES=4,5,6,7，is there anything else I need to set up? Because right now it's showing that my four gpu's with 24G of RAM still don't have enough memory. The training model is using Llama2-7B-HF

trainable params: 134,217,728 || all params: 6,872,641,536 || trainable%: 1.9529278123549145 [train set] examples: 13533; # avg tokens: 370.9773254394531 [train set] examples: 13533; # avg completion tokens: 105.39820861816406 Traceback (most recent call last): File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 183, in main() File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 152, in main trainer = Trainer( File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 456, in init self._move_model_to_device(model, args.device) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 690, in _move_model_to_device model = model.to(device) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to return self._apply(convert) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) [Previous line repeated 4 more times] File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply param_applied = fn(param) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 3 has a total capacty of 23.67 GiB of which 111.62 MiB is free. Including non-PyTorch memory, this process has 23.56 GiB memory in use. Of the allocated memory 23.36 GiB is allocated by PyTorch, and 1.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 183, in main() File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 152, in main trainer = Trainer( File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 456, in init self._move_model_to_device(model, args.device) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 690, in _move_model_to_device model = model.to(device) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to return self._apply(convert) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) [Previous line repeated 4 more times] File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply param_applied = fn(param) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacty of 23.67 GiB of which 111.62 MiB is free. Including non-PyTorch memory, this process has 23.56 GiB memory in use. Of the allocated memory 23.36 GiB is allocated by PyTorch, and 1.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 183, in main() File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 152, in main trainer = Trainer( File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 456, in init self._move_model_to_device(model, args.device) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 690, in _move_model_to_device model = model.to(device) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to return self._apply(convert) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) [Previous line repeated 4 more times] File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply param_applied = fn(param) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 2 has a total capacty of 23.67 GiB of which 111.62 MiB is free. Including non-PyTorch memory, this process has 23.56 GiB memory in use. Of the allocated memory 23.36 GiB is allocated by PyTorch, and 1.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 183, in main() File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 152, in main trainer = Trainer( File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 456, in init self._move_model_to_device(model, args.device) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 690, in _move_model_to_device model = model.to(device) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to return self._apply(convert) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) [Previous line repeated 4 more times] File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply param_applied = fn(param) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 1 has a total capacty of 23.67 GiB of which 111.62 MiB is free. Including non-PyTorch memory, this process has 23.56 GiB memory in use. Of the allocated memory 23.36 GiB is allocated by PyTorch, and 1.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2024-11-03 07:03:40,851] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 991055) of binary: /mnt/users/ylu/anaconda3/envs/xwb_less/bin/python Traceback (most recent call last): File "/mnt/users/ylu/anaconda3/envs/xwb_less/bin/torchrun", line 8, in sys.exit(main()) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Oct 29 '24 14:10 xavierdawn

Have you already found a fix for this...?

Nov 25 '24 20:11 roanvanblanken

Same question here.

Dec 10 '24 13:12 QingyangZhang

same here:)

Jan 04 '25 09:01 Cooperzzy

@QingyangZhang maybe you can try the following:

reduce lora_r and lora_r in base_training_args.sh
use a subset of training data, change train_files in warmup_lora_train.sh with a small .jsonl file

Jan 04 '25 10:01 Cooperzzy

@QingyangZhang maybe you can try the following:也许您可以尝试以下方法：

reduce lora_r and lora_r in base_training_args.sh减少base_training_args.sh中的lora_r和lora_r

use a subset of training data, change train_files in warmup_lora_train.sh with a small .jsonl file使用训练数据的子集，使用小型 .jsonl 文件更改warmup_lora_train.sh train_files

Only change lora_r? If possible, can you tell me what your lora_r setting is? When I was running, I found that the memory was full when loading weights. Have you encountered this situation?

`[INFO|modeling_utils.py:3341] >> loading weights file meta-llama/Llama-2-7b-hf/model.safetensors.index.json [INFO|configuration_utils.py:826] >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2 }

Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]`

Mar 18 '25 08:03 alchowiw