verl [bug] GRPO timeout error in multi node

Error info:

File "/verl/workers/fsdp_workers.py", line 390, in init_model self.rollout, self.rollout_sharding_manager = self._build_rollout() File "/verl/workers/fsdp_workers.py", line 325, in _build_rollout rollout = vLLMRollout(model_path=local_path, File "/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 100, in init self.inference_engine = LLM( File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1022, in inner return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 242, in init self.llm_engine = self.engine_class.from_engine_args( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 489, in from_engine_args engine = cls( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 276, in init self._initialize_kv_caches() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 421, in _initialize_kv_caches self.model_executor.determine_num_available_blocks()) File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 137, in determine_num_available_blocks dist.all_reduce(a_tensor, group=cpu_group, op=dist.ReduceOp.MIN) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2506, in all_reduce work.wait() RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete

worker process: Thread 461 (idle): "MainThread" load_tensor (torch/serialization.py:1772) persistent_load (torch/serialization.py:1812) load (torch/_weights_only_unpickler.py:385) _load (torch/serialization.py:1848) load (torch/serialization.py:1351) pt_weights_iterator (vllm/model_executor/model_loader/weight_utils.py:460) (vllm/model_executor/model_loader/loader.py:369) _get_all_weights (vllm/model_executor/model_loader/loader.py:385) (vllm/model_executor/models/utils.py:99) _groupby_prefix (vllm/model_executor/models/utils.py:101) _load_module (vllm/model_executor/models/utils.py:187) load_weights (vllm/model_executor/models/utils.py:235) load_weights (vllm/model_executor/models/qwen2.py:515) load_model (vllm/model_executor/model_loader/loader.py:409) get_model (vllm/model_executor/model_loader/init.py:14) load_model (vllm/worker/model_runner.py:1112) load_model (vllm/worker/worker.py:183) run_method (vllm/utils.py:2196) collective_rpc (vllm/executor/uniproc_executor.py:56) _init_executor (vllm/executor/uniproc_executor.py:120) init (vllm/executor/executor_base.py:52) init (vllm/engine/llm_engine.py:273) from_engine_args (vllm/engine/llm_engine.py:489) init (vllm/entrypoints/llm.py:242) inner (vllm/utils.py:1022) init (verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py:100)

model: qwen25-72b/qwen25-vl-72b, vllm: 0.7.3

It seems that the worker timed out loading the vllm model, and the backend also observed that the worker process was hang in loading model weights in vllm.

Mar 05 '25 13:03 JoyTim-777

same error

Mar 06 '25 03:03 yiyepiaoling0715

@hiyouga @vermouth1992 Could you please take a look and help me resolve it?

Mar 06 '25 08:03 JoyTim-777

Please see if https://github.com/volcengine/verl/issues/491#issuecomment-2704116935 is the same issue causing timeout error?

Mar 06 '25 15:03 casper-hansen

It doesn't seem like it, the error of #491 was in trainer.init_workers(), but my error occurred while loading vllm, the actor and ref models had already completed FSDP initialization.

Mar 06 '25 15:03 JoyTim-777

Could you please confirm whether your version of verl is compatible with vLLM 0.7.3? When I tried installing it directly, I encountered the following error. I would like to know which version of your package is required.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. verl 0.2.0.dev0 requires vllm<=0.6.3, but you have vllm 0.7.3 which is incompatible.

Mar 08 '25 01:03 zheng-z18

When only training language models, such as qwen2.5, only vllm=0.6.3 is required, but if it is qwen2.5 vl, vllm>=0.7.2 is required

Mar 08 '25 02:03 JoyTim-777

When only training language models, such as qwen2.5, only vllm=0.6.3 is required, but if it is qwen2.5 vl, vllm>=0.7.2 is required

I'm still stuck with the environment installation. If possible, could you please provide a screenshot of your conda list so that I can compare? I would really appreciate it. Thank you very much!

Mar 08 '25 02:03 zheng-z18

I'm sorry, for some reason I couldn't provide it now. I also encountered your mistake, but I think it can be ignored. You can see here #https://github.com/volcengine/verl/commit/b46f55ecc98e40fb36af465fdde9b7f7613e5e50. I installed it based on this commit.

Mar 08 '25 03:03 JoyTim-777

verl verl copied to clipboard

[bug] GRPO timeout error in multi node

verl
verl copied to clipboard