swift=3.10.1 results unexpected OOM during on-policy GKD
Describe the bug What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
CUDA_VISIBLE_DEVICES=6 \
swift rollout \
--model /mnt/qwen/Qwen3-VL-8B-Instruct \
--vllm_max_model_len 24192
NPROC_PER_NODE=7 \
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 \
swift rlhf \
--rlhf_type gkd \
--model /mnt/qwen/Qwen3-VL-8B-Instruct \
--teacher_model /mnt/qwen/Qwen3-VL-235B-A22B-Instruct \
--train_type full \
--dataset /mnt/images/dataset.jsonl \
--seq_kd false \
--lmbda 1 \
--beta 1 \
--torch_dtype bfloat16 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--learning_rate 1e-5 \
--gradient_accumulation_steps 8 \
--save_steps 500 \
--save_total_limit 2 \
--logging_steps 1 \
--max_length 6000 \
--max_completion_length 3000 \
--warmup_ratio 0.05 \
--save_only_model true \
--dataloader_num_workers 64 \
--dataset_num_proc 4 \
--deepspeed zero2 \
--teacher_deepspeed zero3 \
--attn_impl flash_attn \
--use_vllm true \
--vllm_mode server \
--vllm_server_host 127.0.0.1 \
--vllm_server_port 8000 \
--output_dir /home/output/ \
Your hardware and system info Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
Additional context Add any other context about the problem here(在这里补充其他信息) The issue could be resolved after downgrading swift version to 3.10.0
try --deepspeed zero3_offload --teacher_deepspeed zero3_offload
@hjh0119 I also meet this problem.
--deepspeed zero2 --teacher_deepspeed zero3 is ok in 3.10.0 but will oom in 3.10.1 for gkd training
Did anything change between 3.10.0 and 3.10.1 that might cause this OOM issue?