miyeeee comments

Repositories
Issues
Comments

Results 3 comments of


                                            miyeeee

QWQ：GRPO训练无法跑通，报错”RuntimeError: ACL stream synchronize failed, error code:107020“

### 脚本： torchrun --master_addr=${MASTER_ADDR} --master_port ${MASTER_PORT} --nproc_per_node=${NPROC_PER_NODE} --nnodes=${NNODES} --node_rank=${NODE_RANK} \ ${SCRIPT_DIR}/swift/cli/rlhf.py \ --rlhf_type grpo \ --check_model false \ --model /cache/model \ --reward_funcs format \ --use_vllm false \ --vllm_device auto \...

grpo训练32b模型OOM

Met the same problem, have you solved it yet?

grpo训练32b模型OOM

> 32B GRPO full training script https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/Qwen2_5_32B_full.sh what if I don't have 8\*80G GPUs (1 node, 8 GPU per node), but instead have 32\*32G NPUs (4 nodes, 8 NPU per...