ms-swift grpo训练32b模型OOM

用的tensor parallel 8， offload optimizer，flash attention， vllm，在8*96G 的机器上OOM 下面是具体的配置和报错： nproc_per_node=8

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
NNODES=$nnodes
NODE_RANK=$RANK
MASTER_ADDR=$MASTER_ADDR
MASTER_PORT=$MASTER_PORT
NPROC_PER_NODE=$nproc_per_node
swift rlhf
--rlhf_type grpo
--model xxxx/xxxxx
--model-type qwq
--attn_impl flash_attn
--gradient_checkpointing true
--reward_funcs reflection_q
--use_vllm false
--vllm_device auto
--vllm_gpu_memory_utilization 0.8
--vllm_max_model_len 8192
--num_infer_workers 8
--train_type full
--torch_dtype bfloat16
--dataset 'xxxxx.jsonl'
--max_completion_length 2048
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 1e-6
--gradient_accumulation_steps 8
--eval_steps 200
--save_steps 200
--save_total_limit 2
--logging_steps 5
--max_length 2048
--output_dir output
--warmup_ratio 0.05
--dataloader_num_workers 8
--dataset_num_proc 8
--num_generations 8
--temperature 0.9
--deepspeed zero3
--log_completions true
--sleep_level 1
--offload_model true
--offload_optimizer true
--gc_collect_after_offload true
--log_completions true
--tensor_parallel_size 8 \

报错信息：

Apr 14 '25 07:04 zhilinwang1

you can try

--deepspeed zero3_offload
--beta 0 to diasble ref model

Apr 14 '25 08:04 hjh0119

you can try

--deepspeed zero3_offload

--beta 0 to diasble ref model

I've tried to offload the param as well, but still got OOM,
but doing so would remove kl constraint. Does the ref model follow tensor parallel = 8 arg?

I tried lora, but still got OOM any other suggestions anything I can provide that benefits error locating

Apr 14 '25 09:04 zhilinwang1

decrease vllm_gpu_memory_utilization

btw

--sleep_level 1
--offload_model true
--offload_optimizer true
--gc_collect_after_offload true

These options are intended for the vLLM backend. Since you have set --use_vllm false, the above arguments will not take effect. Perhaps setting --use_vllm true will work

Apr 14 '25 09:04 hjh0119

我也遇到一样的问题，和你一样oom在后面vllm部署的推理服务上。很好奇为什么这里的vllm服务不把tensor_parallel_size和pipeline_parallel_size参数开放出来?

Apr 19 '25 05:04 AUFEfzx

Met the same problem, have you solved it yet?

Apr 21 '25 03:04 miyeeee

The tensor parallelism for async mode and the 32B full GRPO training script are currently in development.

Apr 21 '25 05:04 hjh0119

32B GRPO full training script https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/Qwen2_5_32B_full.sh

Apr 23 '25 06:04 hjh0119

32B GRPO full training script https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/Qwen2_5_32B_full.sh

what if I don't have 8*80G GPUs (1 node, 8 GPU per node), but instead have 32*32G NPUs (4 nodes, 8 NPU per node), how should I rewrite the script to support multi node training?

Apr 24 '25 02:04 miyeeee

decrease vllm_gpu_memory_utilization

btw
--sleep_level 1
--offload_model true
--offload_optimizer true
--gc_collect_after_offload true
These options are intended for the vLLM backend. Since you have set --use_vllm false, the above arguments will not take effect. Perhaps setting --use_vllm true will work

I want to know why decreasing vllm_gpu_memory_utilization works. Setting sleep_level and offload isn't supposed to free up all GPU memory occupied by the VLLM backend? I assumed the training would reach a peak during offloading, but I want to know more about the details. Thanks!

Apr 25 '25 15:04 heyubox

32B GRPO full training script https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/Qwen2_5_32B_full.sh

KeyError: 'rollout' when running Qwen2_5_32B_full.sh

Traceback (most recent call last): File "/miniconda3/envs/SWIFT/bin/swift", line 33, in sys.exit(load_entry_point('ms-swift', 'console_scripts', 'swift')()) File "/ms-swift-main/swift/cli/main.py", line 61, in cli_main file_path = importlib.util.find_spec(route_mapping[method_name]).origin KeyError: 'rollout'

May 03 '25 10:05 GuliGuli-Boom

32B GRPO full training script https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/Qwen2_5_32B_full.sh

KeyError: 'rollout' when running Qwen2_5_32B_full.sh

Traceback (most recent call last): File "/miniconda3/envs/SWIFT/bin/swift", line 33, in sys.exit(load_entry_point('ms-swift', 'console_scripts', 'swift')()) File "/ms-swift-main/swift/cli/main.py", line 61, in cli_main file_path = importlib.util.find_spec(route_mapping[method_name]).origin KeyError: 'rollout'

same problem, i checked main.py, found that 'rollout' had been removed

May 03 '25 12:05 ViktorJiangC

swift rollout need swift >=3.5

Jun 26 '25 12:06 hjh0119