LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

DPO显存分布不均匀

Open Arcmoon-Hu opened this issue 11 months ago • 1 comments

Reminder

  • [x] I have read the above rules and searched the existing issues.

System Info

  • llamafactory version: 0.9.2.dev0
  • Platform: Linux-3.10.0-1160.95.1.el7.x86_64-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • PyTorch version: 2.6.0+cu124 (GPU)
  • Transformers version: 4.48.3
  • Datasets version: 3.1.0
  • Accelerate version: 1.0.1
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA A800 80GB PCIe
  • DeepSpeed version: 0.15.1
  • vLLM version: 0.8.2

Reproduction

14B模型,单机8*A800,全参微调,batch size只能1,使用ds_z3_offload,目前遇到的问题
1、训到一定步数就会崩OOM
2、但sft时,使用ds_z3,可以batch_size开到4

Image

Others

No response

Arcmoon-Hu avatar Apr 22 '25 12:04 Arcmoon-Hu

一样的问题,Qwen2.5VL-7B-Instruct是一个lora进行SFT,到一定步数就突然out of memory, batch size和gradient_accumulation_steps都只能设置为1,8卡 A800 显存80G,但是训练的时候一个GPU突然显存占比很大就out of memory了: GPU 5 Memory Allocated (%) 99.98500021922756 GPU 6 Memory Allocated (%) 78.07762809620475 GPU 2 Memory Allocated (%) 62.247705610456464 GPU 3 Memory Allocated (%) 55.404728700119 GPU 1 Memory Allocated (%) 54.42012770582584 GPU 0 Memory Allocated (%) 53.56598634327652 GPU 7 Memory Allocated (%) 53.100762373472996 GPU 4 Memory Allocated (%) 51.39001814588864

Image

FloSophorae avatar Apr 27 '25 04:04 FloSophorae

same problem,大佬们怎么解决的,我的版本是0.9.4.dev0

edc3000 avatar Jul 28 '25 09:07 edc3000

same problem

guotong1988 avatar Jul 29 '25 06:07 guotong1988

@hiyouga 大佬看看吧

guotong1988 avatar Jul 29 '25 07:07 guotong1988