LLaMA-Factory DPO显存分布不均匀

Reminder

[x] I have read the above rules and searched the existing issues.

System Info

llamafactory version: 0.9.2.dev0
Platform: Linux-3.10.0-1160.95.1.el7.x86_64-x86_64-with-glibc2.35
Python version: 3.10.12
PyTorch version: 2.6.0+cu124 (GPU)
Transformers version: 4.48.3
Datasets version: 3.1.0
Accelerate version: 1.0.1
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA A800 80GB PCIe
DeepSpeed version: 0.15.1
vLLM version: 0.8.2

Reproduction

14B模型，单机8*A800，全参微调，batch size只能1，使用ds_z3_offload，目前遇到的问题
1、训到一定步数就会崩OOM
2、但sft时，使用ds_z3，可以batch_size开到4

Others

No response

Apr 22 '25 12:04 Arcmoon-Hu

一样的问题，Qwen2.5VL-7B-Instruct是一个lora进行SFT，到一定步数就突然out of memory, batch size和gradient_accumulation_steps都只能设置为1，8卡 A800 显存80G，但是训练的时候一个GPU突然显存占比很大就out of memory了： GPU 5 Memory Allocated (%) 99.98500021922756 GPU 6 Memory Allocated (%) 78.07762809620475 GPU 2 Memory Allocated (%) 62.247705610456464 GPU 3 Memory Allocated (%) 55.404728700119 GPU 1 Memory Allocated (%) 54.42012770582584 GPU 0 Memory Allocated (%) 53.56598634327652 GPU 7 Memory Allocated (%) 53.100762373472996 GPU 4 Memory Allocated (%) 51.39001814588864

Apr 27 '25 04:04 FloSophorae

same problem，大佬们怎么解决的，我的版本是0.9.4.dev0

Jul 28 '25 09:07 edc3000

same problem

Jul 29 '25 06:07 guotong1988

@hiyouga 大佬看看吧

Jul 29 '25 07:07 guotong1988