./scripts/run_finetune.sh prints error "CUDA out of memory"
I use wsl2 with Ubuntu 22.04 to make the envvironment. when I run ./scripts/run_finetune.sh it print the above error, but I can successfully run ./scripts/run_finetune_with_lora.sh with both gpt2 and robin-7B. My graphics card is GTX 3090 24 GB and my memory is 64GB(maybe 32 GB in WSL). I think it may be a deepspeed bug, could you give me some advice? Thank you!
We did not test our code on wsl2. I'll suggest using linux machine. If you are using windows, you could try Google Colab. Thanks!
We did not test our code on wsl2. I'll suggest using linux machine. If you are using windows, you could try Google Colab. Thanks!
Hi shizhediao,
I meet the same issue when i tried to fine tune llama 7B. I even changed --block_size 256. could you please share suggestion for this OOM?
Traceback (most recent call last):
File "/local2/mnt/workspace/xiaohuin/LMFlow/examples/finetune.py", line 62, in <module>
main()
File "/local2/mnt/workspace/xiaohuin/LMFlow/examples/finetune.py", line 58, in main
tuned_model = finetuner.tune(model=model, dataset=dataset)
File "/local2/mnt/workspace/xiaohuin/LMFlow/src/lmflow/pipeline/finetuner.py", line 285, in tune
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/trainer.py", line 1639, in train
return inner_training_loop(
File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/trainer.py", line 1708, in _inner_training_loop
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/__init__.py", line 125, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 297, in __init__
self._configure_distributed_model(model)
File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1146, in _configure_distributed_model
self.module.to(self.device)
File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1145, in to
return self._apply(convert)
File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
[Previous line repeated 4 more times]
File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/nn/modules/module.py", line 820, in _apply
param_applied = fn(param)
File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 31.75 GiB total capacity; 18.31 GiB already allocated; 114.50 MiB free; 18.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-07-13 18:56:43,155] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17803
[2023-07-13 18:56:43,155] [ERROR] [launch.py:324:sigkill_handler] ['/root/anaconda3/envs/lmflow/bin/python', '-u', 'examples/finetune.py', '--local_rank=0', '--model_name_or_path', '/local2/mnt/workspace/xiaohuin/LMFlow/llama-7b-hf', '--dataset_path', '/local2/mnt/workspace/xiaohuin/LMFlow/data/alpaca/train', '--output_dir', '/local2/mnt/workspace/xiaohuin/LMFlow/output_models/finetune_with_lora', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '1e-4', '--block_size', '256', '--per_device_train_batch_size', '1', '--use_lora', '1', '--lora_r', '8', '--save_aggregated_lora', '0', '--deepspeed', 'configs/ds_config_zero2.json', '--bf16=False', '--run_name', 'finetune_with_lora', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = 1
Here is my command, my GPU doesn't support bf16 and it only has 31.75 GiB memory
deepspeed ${deepspeed_args}
examples/finetune.py
--model_name_or_path ${project_dir}/llama-7b-hf
--dataset_path ${dataset_path}
--output_dir ${output_dir} --overwrite_output_dir
--num_train_epochs 0.01
--learning_rate 1e-4
--block_size 256
--per_device_train_batch_size 1
--use_lora 1
--lora_r 8
--save_aggregated_lora 0
--deepspeed configs/ds_config_zero2.json
--bf16=False
--run_name finetune_with_lora
--validation_split_percentage 0
--logging_steps 20
--do_train
--ddp_timeout 72000
--save_steps 5000
--dataloader_num_workers 1
| tee ${log_dir}/train.log
2> ${log_dir}/train.err
Hi, you can try a smaller model like openlm-research/open_llama_3b
Hi, you can try a smaller model like
openlm-research/open_llama_3b
Thank you very much it works.
This issue has been marked as stale because it has not had recent activity. If you think this still needs to be addressed please feel free to reopen this issue. Thanks