LLaMA-Factory CUDA out of memory when training DPO in parallel on multiple GPUs

Reminder

[x] I have read the above rules and searched the existing issues.

System Info

llamafactory version: 0.9.2
Platform: Linux-3.10.0-693.el7.x86_64-x86_64-with-glibc2.34
Python version: 3.11.9
PyTorch version: 2.6.0+cu124 (GPU)
Transformers version: 4.49.0
Datasets version: 3.2.0
Accelerate version: 1.2.1
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: Tesla V100-PCIE-16GB
GPU number: 8
GPU memory: 15.77GB
DeepSpeed version: 0.16.7
Bitsandbytes version: 0.45.5

Reproduction

DPO Qwen2.5-0.5B的小模型，无论如何调整配置，都会CUDA OOM，相关配置如下：

llamafactory config

### model
model_name_or_path: /data1/zyy/Qwen2.5-0.5B-Address-SFT
trust_remote_code: true

### method
stage: dpo
do_train: true
finetuning_type: full
pref_beta: 0.1
pref_loss: sigmoid
deepspeed: /data1/zyy/toolkit/LLaMA-Factory-0.9.2/examples/deepspeed/ds_z3_offload_config.json

### dataset
dataset: sft_comparison_20250418
template: qwen
cutoff_len: 4096
max_samples: 10000
overwrite_cache: true
preprocessing_num_workers: 4
dataloader_num_workers: 1

### output
output_dir: /data1/zyy/data/dpo_training/outputs/20250418
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 5.0e-6
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: false
ddp_timeout: 180000000
resume_from_checkpoint: null

deepspeed 配置(ds_z3_offload_config.json)：

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 1,
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": false,
    "contiguous_gradients": true,
    "sub_group_size": 1e7,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e7,
    "stage3_max_reuse_distance": 1e7,
    "stage3_gather_16bit_weights_on_model_save": true
  }
}

思路说明：

已开启cpu offload，stage3_max_live_parameters和stage3_max_reuse_distance已经调低到1e7
已经调低batch size和gradient accumulation为1，这样实际执行的时候是1 * 1 * 8(卡)
输入内容不低于2048，不高于4096，所以cutoff_len提高到4096
单一数据集，大约包含9800条记录，因此max_samples设置为10000

错误信息：

{'loss': 0.5527, 'grad_norm': 10.518776893615723, 'learning_rate': 1.2396694214876035e-06, 'rewards/chosen': 0.3123237192630768, 'rewards/rejected': -0.02439103275537491, 'rewards/accuracies': 0.875, 'rewards/margins': 0.3367147445678711, 'logps/chosen': -8.385631561279297, 'logps/rejected': -0.42104580998420715, 'logits/chosen': -5.462330341339111, 'logits/rejected': -5.462770462036133, 'epoch': 0.07}
  3%|▎         | 92/3624 [04:07<2:38:07,  2.69s/it][rank5]: Traceback (most recent call last):
[rank5]:   File "/data1/zyy/toolkit/LLaMA-Factory-0.9.2/src/llamafactory/launcher.py", line 23, in <module>
[rank5]:     launch()
[rank5]:   File "/data1/zyy/toolkit/LLaMA-Factory-0.9.2/src/llamafactory/launcher.py", line 19, in launch
[rank5]:     run_exp()
[rank5]:   File "/data1/zyy/toolkit/LLaMA-Factory-0.9.2/src/llamafactory/train/tuner.py", line 103, in run_exp
[rank5]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank5]:   File "/data1/zyy/toolkit/LLaMA-Factory-0.9.2/src/llamafactory/train/tuner.py", line 74, in _training_function
[rank5]:     run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
[rank5]:   File "/data1/zyy/toolkit/LLaMA-Factory-0.9.2/src/llamafactory/train/dpo/workflow.py", line 83, in run_dpo
[rank5]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank5]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/usr/local/lib/python3.11/site-packages/transformers/trainer.py", line 2241, in train
[rank5]:     return inner_training_loop(
[rank5]:            ^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/usr/local/lib/python3.11/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop
[rank5]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank5]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/usr/local/lib/python3.11/site-packages/transformers/trainer.py", line 3698, in training_step
[rank5]:     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/data1/zyy/toolkit/LLaMA-Factory-0.9.2/src/llamafactory/train/dpo/trainer.py", line 287, in compute_loss
[rank5]:     return super().compute_loss(model, inputs, return_outputs)
[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/usr/local/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py", line 1408, in compute_loss
[rank5]:     loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
[rank5]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/data1/zyy/toolkit/LLaMA-Factory-0.9.2/src/llamafactory/train/dpo/trainer.py", line 254, in get_batch_loss_metrics
[rank5]:     reference_chosen_logps, reference_rejected_logps = self.compute_reference_log_probs(model, batch)
[rank5]:                                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/data1/zyy/toolkit/LLaMA-Factory-0.9.2/src/llamafactory/train/dpo/trainer.py", line 231, in compute_reference_log_probs
[rank5]:     reference_chosen_logps, reference_rejected_logps, *_ = self.concatenated_forward(ref_model, batch)
[rank5]:                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/data1/zyy/toolkit/LLaMA-Factory-0.9.2/src/llamafactory/train/dpo/trainer.py", line 199, in concatenated_forward
[rank5]:     all_logps, valid_length = get_batch_logps(logits=all_logits, labels=batch["labels"])
[rank5]:                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/data1/zyy/toolkit/LLaMA-Factory-0.9.2/src/llamafactory/train/trainer_utils.py", line 562, in get_batch_logps
[rank5]:     per_token_logps = torch.gather(logits.log_softmax(-1), dim=2, index=labels.unsqueeze(2)).squeeze(2)
[rank5]:                                    ^^^^^^^^^^^^^^^^^^^^^^
[rank5]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.71 GiB. GPU 5 has a total capacity of 15.77 GiB of which 2.46 GiB is free. Process 20225 has 13.31 GiB memory in use. Of the allocated memory 11.83 GiB is allocated by PyTorch, and 926.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Others

找到之前的一个Issue：https://github.com/hiyouga/LLaMA-Factory/issues/6800

按思路调整没有解决问题（里面提到使用的V100是32GB显存版本的）

Apr 24 '25 02:04 zhengyangyong

mark一下，我最近也在跑dpo训练的实验给我的感觉是dpo训练的时候，z3 offload不会起作用，并不会offload到cpu，不确定是不是真的是这样

Apr 24 '25 05:04 AlexSu1108

试一下关掉 DeepSpeed，打开 pure_bf16

Apr 24 '25 07:04 hiyouga

V100不支持BF16

Apr 25 '25 01:04 zhengyangyong

用 lora

Apr 25 '25 02:04 hiyouga

这里我也想麻烦问下，我是了dpo full跟dpo lora，给我的感觉显存占用没什么区别？是我的yaml脚本不对吗？

dpo_lora.yaml

`### model model_name_or_path: /opt/ml/pretrain_model trust_remote_code: true

method

stage: dpo do_train: true finetuning_type: lora lora_rank: 32 lora_dropout: 0.05 lora_target: all pref_beta: 0.1 pref_loss: sigmoid # choices: [sigmoid (dpo), orpo, simpo] deepspeed: examples/deepspeed/ds_z3_offload_config.json

dataset

dataset: labeled_dpo_0422_819 template: qwen cutoff_len: 7000 max_samples: 10000 overwrite_cache: true preprocessing_num_workers: 64 dataloader_num_workers: 4

output

output_dir: /opt/ml/output/data logging_steps: 5 save_strategy: epoch plot_loss: true overwrite_output_dir: true save_only_model: false

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 1 learning_rate: 1.0e-5 num_train_epochs: 3 lr_scheduler_type: cosine warmup_ratio: 0.03 bf16: true ddp_timeout: 180000000 resume_from_checkpoint: false`

Apr 25 '25 02:04 AlexSu1108

用 lora

lora效果不太好感觉，所以...

Apr 27 '25 00:04 zhengyangyong

最新进展我使用分布式微调，两台 V100 16G * 8，一样OOM，所以感觉就是batch size=1 都吃不下去

Apr 27 '25 00:04 zhengyangyong

用 lora

现在更换了lora，如果使用默认的lora_rank=8微调0.5B的模型，调整的参数是： trainable params: 4,399,104 || all params: 498,431,872 || trainable%: 0.8826 请问一般什么比例更好呢？

Apr 29 '25 08:04 zhengyangyong

rank 可以设置 16

Apr 29 '25 09:04 hiyouga