问题描述
在D910B NPU卡上进行QWQ的GRPO训练时遇到报错,具体为同步计算流超时(已设置export HCCL_EXEC_TIMEOUT=3600)
环境配置
报错内容
Traceback (most recent call last):
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/cli/rlhf.py", line 5, in
rlhf_main()
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/train/rlhf.py", line 98, in rlhf_main
return SwiftRLHF(args).main()
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/train/sft.py", line 31, in init
self._prepare_model_tokenizer()
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/train/rlhf.py", line 65, in _prepare_model_tokenizer
super()._prepare_model_tokenizer()
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/train/sft.py", line 62, in _prepare_model_tokenizer
self.model, self.processor = args.get_model_processor()
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/argument/base_args/base_args.py", line 276, in get_model_processor
return get_model_tokenizer(**kwargs)
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/model/register.py", line 564, in get_model_tokenizer
model, processor = get_function(model_dir, model_info, model_kwargs, load_model, **kwargs)
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/model/register.py", line 265, in get_model_tokenizer_with_flash_attn
return get_model_tokenizer_from_local(model_dir, model_info, model_kwargs, load_model, **kwargs)
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/model/register.py", line 234, in get_model_tokenizer_from_local
model = automodel_class.from_pretrained(
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
return model_class.from_pretrained(
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/model/patcher.py", line 285, in _new_from_pretrained
return from_pretrained(cls, *args, **kwargs)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4225, in from_pretrained
) = cls._load_pretrained_model(
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4728, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/modeling_utils.py", line 993, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 329, in set_module_tensor_to_device
new_value = value.to(device)
RuntimeError: ACL stream synchronize failed, error code:107020
[W compiler_depend.ts:465] Warning: NPU warning, error code is 107020[Error]: .
EH9999: Inner Error!
rtDeviceSynchronize execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
EH9999: [PID: 5139] 2025-04-18-15:50:02.969.138 wait for compute device to finish failed, runtime result = 107020.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
TraceBack (most recent call last):
(function npuSynchronizeUsedDevices)
脚本:
torchrun --master_addr=${MASTER_ADDR} --master_port ${MASTER_PORT} --nproc_per_node=${NPROC_PER_NODE} --nnodes=${NNODES} --node_rank=${NODE_RANK}
${SCRIPT_DIR}/swift/cli/rlhf.py
--rlhf_type grpo
--check_model false
--model /cache/model
--reward_funcs format
--use_vllm false
--vllm_device auto
--gradient_checkpointing_kwargs '{"use_reentrant": false}'
--vllm_gpu_memory_utilization 0.6
--train_type lora
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.05
--target_modules all-linear
--torch_dtype bfloat16
--dataset /cache/data
--max_completion_length 1024
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 1e-5
--gradient_accumulation_steps 1
--eval_strategy 'steps'
--eval_steps 100
--save_strategy 'steps'
--save_steps 100
--logging_steps 5
--max_length 2048
--output_dir /cache/output
--warmup_ratio 0.05
--dataloader_num_workers 4
--dataset_num_proc 4
--num_generations 8
--temperature 0.9
--system /cache/prompt.txt
--log_completions true
--num_iterations 1
--num_infer_workers 1
--async_generate false
--beta 0.0
--max_grad_norm 0.5
--model_type qwen2_5
--tensor_parallel_size 8
补充请教一个问题:--async_generate false 是否就是关闭了async mode?
For NPU, it is recommended to use an external vLLM server.
https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO.html#external