ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

grpo训练32b模型OOM

Open zhilinwang1 opened this issue 8 months ago • 11 comments

用的tensor parallel 8, offload optimizer,flash attention, vllm,在8*96G 的机器上OOM 下面是具体的配置和报错: nproc_per_node=8

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
NNODES=$nnodes
NODE_RANK=$RANK
MASTER_ADDR=$MASTER_ADDR
MASTER_PORT=$MASTER_PORT
NPROC_PER_NODE=$nproc_per_node
swift rlhf
--rlhf_type grpo
--model xxxx/xxxxx
--model-type qwq
--attn_impl flash_attn
--gradient_checkpointing true
--reward_funcs reflection_q
--use_vllm false
--vllm_device auto
--vllm_gpu_memory_utilization 0.8
--vllm_max_model_len 8192
--num_infer_workers 8
--train_type full
--torch_dtype bfloat16
--dataset 'xxxxx.jsonl'
--max_completion_length 2048
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 1e-6
--gradient_accumulation_steps 8
--eval_steps 200
--save_steps 200
--save_total_limit 2
--logging_steps 5
--max_length 2048
--output_dir output
--warmup_ratio 0.05
--dataloader_num_workers 8
--dataset_num_proc 8
--num_generations 8
--temperature 0.9
--deepspeed zero3
--log_completions true
--sleep_level 1
--offload_model true
--offload_optimizer true
--gc_collect_after_offload true
--log_completions true
--tensor_parallel_size 8 \

报错信息:

Image

zhilinwang1 avatar Apr 14 '25 07:04 zhilinwang1

you can try

  1. --deepspeed zero3_offload
  2. --beta 0 to diasble ref model

hjh0119 avatar Apr 14 '25 08:04 hjh0119

you can try

  1. --deepspeed zero3_offload
  2. --beta 0 to diasble ref model
  1. I've tried to offload the param as well, but still got OOM,
  2. but doing so would remove kl constraint. Does the ref model follow tensor parallel = 8 arg?

I tried lora, but still got OOM any other suggestions anything I can provide that benefits error locating

zhilinwang1 avatar Apr 14 '25 09:04 zhilinwang1

decrease vllm_gpu_memory_utilization

btw

--sleep_level 1
--offload_model true
--offload_optimizer true
--gc_collect_after_offload true

These options are intended for the vLLM backend. Since you have set --use_vllm false, the above arguments will not take effect. Perhaps setting --use_vllm true will work

hjh0119 avatar Apr 14 '25 09:04 hjh0119

我也遇到一样的问题,和你一样oom在后面vllm部署的推理服务上。很好奇为什么这里的vllm服务不把tensor_parallel_size和pipeline_parallel_size参数开放出来?

AUFEfzx avatar Apr 19 '25 05:04 AUFEfzx

Met the same problem, have you solved it yet?

miyeeee avatar Apr 21 '25 03:04 miyeeee

The tensor parallelism for async mode and the 32B full GRPO training script are currently in development.

hjh0119 avatar Apr 21 '25 05:04 hjh0119

32B GRPO full training script https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/Qwen2_5_32B_full.sh

hjh0119 avatar Apr 23 '25 06:04 hjh0119

32B GRPO full training script https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/Qwen2_5_32B_full.sh

what if I don't have 8*80G GPUs (1 node, 8 GPU per node), but instead have 32*32G NPUs (4 nodes, 8 NPU per node), how should I rewrite the script to support multi node training?

miyeeee avatar Apr 24 '25 02:04 miyeeee

decrease vllm_gpu_memory_utilization

btw

--sleep_level 1
--offload_model true
--offload_optimizer true
--gc_collect_after_offload true

These options are intended for the vLLM backend. Since you have set --use_vllm false, the above arguments will not take effect. Perhaps setting --use_vllm true will work

I want to know why decreasing vllm_gpu_memory_utilization works. Setting sleep_level and offload isn't supposed to free up all GPU memory occupied by the VLLM backend? I assumed the training would reach a peak during offloading, but I want to know more about the details. Thanks!

heyubox avatar Apr 25 '25 15:04 heyubox

32B GRPO full training script https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/Qwen2_5_32B_full.sh

KeyError: 'rollout' when running Qwen2_5_32B_full.sh

Traceback (most recent call last): File "/miniconda3/envs/SWIFT/bin/swift", line 33, in sys.exit(load_entry_point('ms-swift', 'console_scripts', 'swift')()) File "/ms-swift-main/swift/cli/main.py", line 61, in cli_main file_path = importlib.util.find_spec(route_mapping[method_name]).origin KeyError: 'rollout'

GuliGuli-Boom avatar May 03 '25 10:05 GuliGuli-Boom

32B GRPO full training script https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/Qwen2_5_32B_full.sh

KeyError: 'rollout' when running Qwen2_5_32B_full.sh

Traceback (most recent call last): File "/miniconda3/envs/SWIFT/bin/swift", line 33, in sys.exit(load_entry_point('ms-swift', 'console_scripts', 'swift')()) File "/ms-swift-main/swift/cli/main.py", line 61, in cli_main file_path = importlib.util.find_spec(route_mapping[method_name]).origin KeyError: 'rollout'

same problem, i checked main.py, found that 'rollout' had been removed

ViktorJiangC avatar May 03 '25 12:05 ViktorJiangC

swift rollout need swift >=3.5

hjh0119 avatar Jun 26 '25 12:06 hjh0119