LMFlow ./scripts/run_finetune.sh prints error "CUDA out of memory"

I use wsl2 with Ubuntu 22.04 to make the envvironment. when I run ./scripts/run_finetune.sh it print the above error, but I can successfully run ./scripts/run_finetune_with_lora.sh with both gpt2 and robin-7B. My graphics card is GTX 3090 24 GB and my memory is 64GB(maybe 32 GB in WSL). I think it may be a deepspeed bug, could you give me some advice? Thank you!

Jun 26 '23 05:06 LucasPKU

We did not test our code on wsl2. I'll suggest using linux machine. If you are using windows, you could try Google Colab. Thanks!

Jun 26 '23 14:06 shizhediao

We did not test our code on wsl2. I'll suggest using linux machine. If you are using windows, you could try Google Colab. Thanks!

Hi shizhediao,

I meet the same issue when i tried to fine tune llama 7B. I even changed --block_size 256. could you please share suggestion for this OOM?

Traceback (most recent call last):
  File "/local2/mnt/workspace/xiaohuin/LMFlow/examples/finetune.py", line 62, in <module>
    main()
  File "/local2/mnt/workspace/xiaohuin/LMFlow/examples/finetune.py", line 58, in main
    tuned_model = finetuner.tune(model=model, dataset=dataset)
  File "/local2/mnt/workspace/xiaohuin/LMFlow/src/lmflow/pipeline/finetuner.py", line 285, in tune
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/trainer.py", line 1639, in train
    return inner_training_loop(
  File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/trainer.py", line 1708, in _inner_training_loop
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
  File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/__init__.py", line 125, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 297, in __init__
    self._configure_distributed_model(model)
  File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1146, in _configure_distributed_model
    self.module.to(self.device)
  File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 4 more times]
  File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 31.75 GiB total capacity; 18.31 GiB already allocated; 114.50 MiB free; 18.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-07-13 18:56:43,155] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17803
[2023-07-13 18:56:43,155] [ERROR] [launch.py:324:sigkill_handler] ['/root/anaconda3/envs/lmflow/bin/python', '-u', 'examples/finetune.py', '--local_rank=0', '--model_name_or_path', '/local2/mnt/workspace/xiaohuin/LMFlow/llama-7b-hf', '--dataset_path', '/local2/mnt/workspace/xiaohuin/LMFlow/data/alpaca/train', '--output_dir', '/local2/mnt/workspace/xiaohuin/LMFlow/output_models/finetune_with_lora', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '1e-4', '--block_size', '256', '--per_device_train_batch_size', '1', '--use_lora', '1', '--lora_r', '8', '--save_aggregated_lora', '0', '--deepspeed', 'configs/ds_config_zero2.json', '--bf16=False', '--run_name', 'finetune_with_lora', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = 1

Jul 13 '23 11:07 judyhappy

Here is my command, my GPU doesn't support bf16 and it only has 31.75 GiB memory

deepspeed ${deepspeed_args}
examples/finetune.py
--model_name_or_path ${project_dir}/llama-7b-hf
--dataset_path ${dataset_path}
--output_dir ${output_dir} --overwrite_output_dir
--num_train_epochs 0.01
--learning_rate 1e-4
--block_size 256
--per_device_train_batch_size 1
--use_lora 1
--lora_r 8
--save_aggregated_lora 0
--deepspeed configs/ds_config_zero2.json
--bf16=False
--run_name finetune_with_lora
--validation_split_percentage 0
--logging_steps 20
--do_train
--ddp_timeout 72000
--save_steps 5000
--dataloader_num_workers 1
| tee ${log_dir}/train.log
2> ${log_dir}/train.err

Jul 13 '23 11:07 judyhappy

Hi, you can try a smaller model like openlm-research/open_llama_3b

Jul 13 '23 13:07 shizhediao

Hi, you can try a smaller model like openlm-research/open_llama_3b

Thank you very much it works.

Jul 14 '23 03:07 judyhappy

This issue has been marked as stale because it has not had recent activity. If you think this still needs to be addressed please feel free to reopen this issue. Thanks

Sep 30 '23 19:09 shizhediao