ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

swift=3.10.1 results unexpected OOM during on-policy GKD

Open slbnuaa opened this issue 1 month ago • 1 comments

Describe the bug What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)

CUDA_VISIBLE_DEVICES=6 \
swift rollout \
    --model /mnt/qwen/Qwen3-VL-8B-Instruct \
    --vllm_max_model_len 24192


NPROC_PER_NODE=7 \
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 \
swift rlhf \
    --rlhf_type gkd \
    --model /mnt/qwen/Qwen3-VL-8B-Instruct \
    --teacher_model /mnt/qwen/Qwen3-VL-235B-A22B-Instruct \
    --train_type full \
    --dataset /mnt/images/dataset.jsonl \
    --seq_kd false \
    --lmbda 1 \
    --beta 1 \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --learning_rate 1e-5 \
    --gradient_accumulation_steps 8 \
    --save_steps 500 \
    --save_total_limit 2 \
    --logging_steps 1 \
    --max_length 6000 \
    --max_completion_length 3000 \
    --warmup_ratio 0.05 \
    --save_only_model true \
    --dataloader_num_workers 64 \
    --dataset_num_proc 4 \
    --deepspeed zero2 \
    --teacher_deepspeed zero3 \
    --attn_impl flash_attn \
    --use_vllm true \
    --vllm_mode server \
    --vllm_server_host 127.0.0.1 \
    --vllm_server_port 8000 \
    --output_dir /home/output/ \
Image The GPU memory would be occupied fully once the checkpoint start to load and OOM will occur when hit 31% progress everytime。

Your hardware and system info Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)

Image

Additional context Add any other context about the problem here(在这里补充其他信息) The issue could be resolved after downgrading swift version to 3.10.0

slbnuaa avatar Nov 17 '25 11:11 slbnuaa

try --deepspeed zero3_offload --teacher_deepspeed zero3_offload

hjh0119 avatar Nov 17 '25 11:11 hjh0119

@hjh0119 I also meet this problem.
--deepspeed zero2 --teacher_deepspeed zero3 is ok in 3.10.0 but will oom in 3.10.1 for gkd training

Did anything change between 3.10.0 and 3.10.1 that might cause this OOM issue?

yd-oom avatar Nov 27 '25 01:11 yd-oom