grpo训练32b模型OOM
用的tensor parallel 8, offload optimizer,flash attention, vllm,在8*96G 的机器上OOM 下面是具体的配置和报错: nproc_per_node=8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
NNODES=$nnodes
NODE_RANK=$RANK
MASTER_ADDR=$MASTER_ADDR
MASTER_PORT=$MASTER_PORT
NPROC_PER_NODE=$nproc_per_node
swift rlhf
--rlhf_type grpo
--model xxxx/xxxxx
--model-type qwq
--attn_impl flash_attn
--gradient_checkpointing true
--reward_funcs reflection_q
--use_vllm false
--vllm_device auto
--vllm_gpu_memory_utilization 0.8
--vllm_max_model_len 8192
--num_infer_workers 8
--train_type full
--torch_dtype bfloat16
--dataset 'xxxxx.jsonl'
--max_completion_length 2048
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 1e-6
--gradient_accumulation_steps 8
--eval_steps 200
--save_steps 200
--save_total_limit 2
--logging_steps 5
--max_length 2048
--output_dir output
--warmup_ratio 0.05
--dataloader_num_workers 8
--dataset_num_proc 8
--num_generations 8
--temperature 0.9
--deepspeed zero3
--log_completions true
--sleep_level 1
--offload_model true
--offload_optimizer true
--gc_collect_after_offload true
--log_completions true
--tensor_parallel_size 8 \
报错信息:
you can try
--deepspeed zero3_offload--beta 0to diasble ref model
you can try
--deepspeed zero3_offload--beta 0to diasble ref model
- I've tried to offload the param as well, but still got OOM,
- but doing so would remove kl constraint. Does the ref model follow tensor parallel = 8 arg?
I tried lora, but still got OOM any other suggestions anything I can provide that benefits error locating
decrease vllm_gpu_memory_utilization
btw
--sleep_level 1
--offload_model true
--offload_optimizer true
--gc_collect_after_offload true
These options are intended for the vLLM backend. Since you have set --use_vllm false, the above arguments will not take effect. Perhaps setting --use_vllm true will work
我也遇到一样的问题,和你一样oom在后面vllm部署的推理服务上。很好奇为什么这里的vllm服务不把tensor_parallel_size和pipeline_parallel_size参数开放出来?
Met the same problem, have you solved it yet?
The tensor parallelism for async mode and the 32B full GRPO training script are currently in development.
32B GRPO full training script https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/Qwen2_5_32B_full.sh
32B GRPO full training script https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/Qwen2_5_32B_full.sh
what if I don't have 8*80G GPUs (1 node, 8 GPU per node), but instead have 32*32G NPUs (4 nodes, 8 NPU per node), how should I rewrite the script to support multi node training?
decrease
vllm_gpu_memory_utilizationbtw
--sleep_level 1 --offload_model true --offload_optimizer true --gc_collect_after_offload trueThese options are intended for the vLLM backend. Since you have set
--use_vllm false, the above arguments will not take effect. Perhaps setting--use_vllm truewill work
I want to know why decreasing vllm_gpu_memory_utilization works. Setting sleep_level and offload isn't supposed to free up all GPU memory occupied by the VLLM backend? I assumed the training would reach a peak during offloading, but I want to know more about the details. Thanks!
32B GRPO full training script https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/Qwen2_5_32B_full.sh
KeyError: 'rollout' when running Qwen2_5_32B_full.sh
Traceback (most recent call last):
File "/miniconda3/envs/SWIFT/bin/swift", line 33, in
32B GRPO full training script https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/Qwen2_5_32B_full.sh
KeyError: 'rollout' when running Qwen2_5_32B_full.sh
Traceback (most recent call last): File "/miniconda3/envs/SWIFT/bin/swift", line 33, in sys.exit(load_entry_point('ms-swift', 'console_scripts', 'swift')()) File "/ms-swift-main/swift/cli/main.py", line 61, in cli_main file_path = importlib.util.find_spec(route_mapping[method_name]).origin KeyError: 'rollout'
same problem, i checked main.py, found that 'rollout' had been removed
swift rollout need swift >=3.5