verl icon indicating copy to clipboard operation
verl copied to clipboard

指定gpu运行失败:using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.

Open xiaoAugenstern opened this issue 7 months ago • 13 comments

报错: [36m(WorkerDict pid=664594)[0m Loading checkpoint shards: 100%|██████████| 5/5 [00:00<00:00, 25.59it/s] [36m(WorkerDict pid=664594)[0m [rank1]:[W510 09:18:00.213483298 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. [36m(WorkerDict pid=664335)[0m NCCL version 2.21.5+cuda12.4

我的配置文件是:

set -x
ENGINE=${1:-vllm}
# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
#export VLLM_ATTENTION_BACKEND=XFORMERS
export HYDRA_FULL_ERROR=1

export CUDA_VISIBLE_DEVICES=2,5,6,7
ray job submit --address="http://127.0.0.1:8265" \
    --runtime-env=verl/trainer/runtime_env.yaml \
    --no-wait \
    -- \
python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=/home/xiaoman/project/gec/Visual-RFT/cgec/llm/verl/data/geo3k/train.parquet \
    data.val_files=/home/xiaoman/project/gec/Visual-RFT/cgec/llm/verl/data/geo3k/test.parquet \
    data.train_batch_size=512 \
    data.max_prompt_length=1024 \
    data.max_response_length=2048 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    data.image_key=images \
    actor_rollout_ref.model.path=/home/LLMs/Qwen/Qwen2.5-VL-7B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=128 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=10 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.01 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=20 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=$ENGINE \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.enable_chunked_prefill=False \
    actor_rollout_ref.rollout.enforce_eager=False \
    actor_rollout_ref.rollout.free_cache_engine=False \
    actor_rollout_ref.rollout.n=5 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=20 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='verl_grpo_example_geo3k' \
    trainer.experiment_name='qwen2_5_vl_7b_function_rm' \
    trainer.n_gpus_per_node=4 \
    trainer.nnodes=1 \
    trainer.save_freq=20 \
    trainer.test_freq=5 \
    trainer.total_epochs=1 $@

linux命令是:

ray stop
export CUDA_VISIBLE_DEVICES=2,5,6,7
ray start --head --port=6379 --dashboard-host=0.0.0.0

import ray
ray.init(address="auto")
print(ray.available_resources())

1. 会挂起30min后失败,不能train,进程通信不了,请问如何解决? 2. 如何ray指定gpu?

您的工作非常好,麻烦作者看到回复一下,谢谢!!

xiaoAugenstern avatar May 10 '25 13:05 xiaoAugenstern

你好,请问解决了吗

jnanliu avatar May 23 '25 06:05 jnanliu

你好,请问解决了吗

还没有

xiaoAugenstern avatar May 26 '25 06:05 xiaoAugenstern

请问解决了吗?

yuanzhang0 avatar Jun 01 '25 10:06 yuanzhang0

大家解决了吗?

ZhuHongZez avatar Jun 04 '25 14:06 ZhuHongZez

我使用八卡节点的4卡镜像时也会遇到类似的错误。使用八卡镜像正常。

ZhuHongZez avatar Jun 04 '25 14:06 ZhuHongZez

我使用八卡节点的4卡镜像时也会遇到类似的错误。使用八卡镜像正常。

没有解决,一直没有跑起来

xiaoAugenstern avatar Jun 05 '25 06:06 xiaoAugenstern

我使用八卡节点的4卡镜像时也会遇到类似的错误。使用八卡镜像正常。

没有解决,一直没有跑起来

确实很奇怪,我用八卡镜像指定export CUDA_VISIBLE_DEVICES=7,2,4 也没有问题,但一但使用4卡镜像就会报错。

ZhuHongZez avatar Jun 05 '25 06:06 ZhuHongZez

(WorkerDict pid=324403) You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] (WorkerDict pid=324403) [rank1]:[W604 19:53:19.704527568 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. (raylet) The node with node id: a410d8e4e5d07705c7393e6aaa93aed119c021de58474b7c5a206bd1 and address: 172.22.62.4 and node name: 172.22.62.4 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload. Traceback (most recent call last): File "/zhuhongze/envs/easyr1/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/zhuhongze/envs/easyr1/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/zhuhongze/EasyR1/verl/trainer/main.py", line 132, in main() File "/zhuhongze/EasyR1/verl/trainer/main.py", line 128, in main ray.get(runner.run.remote(ppo_config)) File "/zhuhongze/envs/easyr1/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) File "/zhuhongze/envs/easyr1/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/zhuhongze/envs/easyr1/lib/python3.10/site-packages/ray/_private/worker.py", line 2755, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/zhuhongze/envs/easyr1/lib/python3.10/site-packages/ray/_private/worker.py", line 908, in get_objects raise value ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. class_name: Runner actor_id: dfcf0d89fc2629413d4d9a7d01000000 pid: 316010 namespace: b8949820-57de-4860-902c-aa39ac470a46 ip: 172.22.62.4 The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 172.22.62.4 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed. (raylet) [2025-06-04 19:53:22,585 E 312152 312277] (raylet) agent_manager.cc:83: The raylet exited immediately because one Ray agent failed, agent_name = dashboard_agent/424238335. (raylet) The raylet fate shares with the agent. This can happen because (raylet) - The version of grpcio doesn't follow Ray's requirement. Agent can segfault with the incorrect grpcio version. Check the grpcio version pip freeze | grep grpcio. (raylet) - The agent failed to start because of unexpected error or port conflict. Read the log cat /tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure. (raylet) - The agent is killed by the OS (e.g., out of memory).

  • 2 examples/qwen2_5_vl_3b_geo3k_grpo.sh: line 24: 2: command not found

ZhuHongZez avatar Jun 05 '25 06:06 ZhuHongZez

(WorkerDict pid=324403) You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] (WorkerDict pid=324403) [rank1]:[W604 19:53:19.704527568 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. (raylet) The node with node id: a410d8e4e5d07705c7393e6aaa93aed119c021de58474b7c5a206bd1 and address: 172.22.62.4 and node name: 172.22.62.4 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload. Traceback (most recent call last): File "/zhuhongze/envs/easyr1/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/zhuhongze/envs/easyr1/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/zhuhongze/EasyR1/verl/trainer/main.py", line 132, in main() File "/zhuhongze/EasyR1/verl/trainer/main.py", line 128, in main ray.get(runner.run.remote(ppo_config)) File "/zhuhongze/envs/easyr1/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) File "/zhuhongze/envs/easyr1/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/zhuhongze/envs/easyr1/lib/python3.10/site-packages/ray/_private/worker.py", line 2755, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/zhuhongze/envs/easyr1/lib/python3.10/site-packages/ray/_private/worker.py", line 908, in get_objects raise value ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. class_name: Runner actor_id: dfcf0d89fc2629413d4d9a7d01000000 pid: 316010 namespace: b8949820-57de-4860-902c-aa39ac470a46 ip: 172.22.62.4 The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 172.22.62.4 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed. (raylet) [2025-06-04 19:53:22,585 E 312152 312277] (raylet) agent_manager.cc:83: The raylet exited immediately because one Ray agent failed, agent_name = dashboard_agent/424238335. (raylet) The raylet fate shares with the agent. This can happen because (raylet) - The version of grpcio doesn't follow Ray's requirement. Agent can segfault with the incorrect grpcio version. Check the grpcio version pip freeze | grep grpcio. (raylet) - The agent failed to start because of unexpected error or port conflict. Read the log cat /tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure. (raylet) - The agent is killed by the OS (e.g., out of memory).

  • 2 examples/qwen2_5_vl_3b_geo3k_grpo.sh: line 24: 2: command not found

Hi, is it solved?

xtu-xiaoc avatar Jun 25 '25 07:06 xtu-xiaoc

(WorkerDict pid=324403) You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] (WorkerDict pid=324403) [rank1]:[W604 19:53:19.704527568 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. (raylet) The node with node id: a410d8e4e5d07705c7393e6aaa93aed119c021de58474b7c5a206bd1 and address: 172.22.62.4 and node name: 172.22.62.4 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload. Traceback (most recent call last): File "/zhuhongze/envs/easyr1/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/zhuhongze/envs/easyr1/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/zhuhongze/EasyR1/verl/trainer/main.py", line 132, in main() File "/zhuhongze/EasyR1/verl/trainer/main.py", line 128, in main ray.get(runner.run.remote(ppo_config)) File "/zhuhongze/envs/easyr1/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) File "/zhuhongze/envs/easyr1/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/zhuhongze/envs/easyr1/lib/python3.10/site-packages/ray/_private/worker.py", line 2755, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/zhuhongze/envs/easyr1/lib/python3.10/site-packages/ray/_private/worker.py", line 908, in get_objects raise value ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. class_name: Runner actor_id: dfcf0d89fc2629413d4d9a7d01000000 pid: 316010 namespace: b8949820-57de-4860-902c-aa39ac470a46 ip: 172.22.62.4 The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 172.22.62.4 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed. (raylet) [2025-06-04 19:53:22,585 E 312152 312277] (raylet) agent_manager.cc:83: The raylet exited immediately because one Ray agent failed, agent_name = dashboard_agent/424238335. (raylet) The raylet fate shares with the agent. This can happen because (raylet) - The version of grpcio doesn't follow Ray's requirement. Agent can segfault with the incorrect grpcio version. Check the grpcio version pip freeze | grep grpcio. (raylet) - The agent failed to start because of unexpected error or port conflict. Read the log cat /tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure. (raylet) - The agent is killed by the OS (e.g., out of memory).

  • 2 examples/qwen2_5_vl_3b_geo3k_grpo.sh: line 24: 2: command not found

Hi, is it solved?

pip uninstall grpcio, athough it is not a good way.

ZhuHongZez avatar Jun 25 '25 07:06 ZhuHongZez

大家解决这个问题了吗

suyan-liang avatar Aug 21 '25 12:08 suyan-liang

Hi, is there any update regarding this issue? It bothered me quite a few days.

kygguo avatar Oct 30 '25 14:10 kygguo

我在使用llama_factory的时候也遇到了这个问题,靠这个博客里面的方法解决了,大家可以试试:https://blog.gitcode.com/a5917737eccf9478cff21eb4475b4269.html

WDYYY-XIXI avatar Nov 25 '25 06:11 WDYYY-XIXI