verl icon indicating copy to clipboard operation
verl copied to clipboard

[bug] GRPO timeout error in multi node

Open JoyTim-777 opened this issue 9 months ago • 4 comments

Error info:

File "/verl/workers/fsdp_workers.py", line 390, in init_model self.rollout, self.rollout_sharding_manager = self._build_rollout() File "/verl/workers/fsdp_workers.py", line 325, in _build_rollout rollout = vLLMRollout(model_path=local_path, File "/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 100, in init self.inference_engine = LLM( File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1022, in inner return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 242, in init self.llm_engine = self.engine_class.from_engine_args( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 489, in from_engine_args engine = cls( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 276, in init self._initialize_kv_caches() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 421, in _initialize_kv_caches self.model_executor.determine_num_available_blocks()) File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 137, in determine_num_available_blocks dist.all_reduce(a_tensor, group=cpu_group, op=dist.ReduceOp.MIN) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2506, in all_reduce work.wait() RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete

worker process: Thread 461 (idle): "MainThread" load_tensor (torch/serialization.py:1772) persistent_load (torch/serialization.py:1812) load (torch/_weights_only_unpickler.py:385) _load (torch/serialization.py:1848) load (torch/serialization.py:1351) pt_weights_iterator (vllm/model_executor/model_loader/weight_utils.py:460) (vllm/model_executor/model_loader/loader.py:369) _get_all_weights (vllm/model_executor/model_loader/loader.py:385) (vllm/model_executor/models/utils.py:99) _groupby_prefix (vllm/model_executor/models/utils.py:101) _load_module (vllm/model_executor/models/utils.py:187) load_weights (vllm/model_executor/models/utils.py:235) load_weights (vllm/model_executor/models/qwen2.py:515) load_model (vllm/model_executor/model_loader/loader.py:409) get_model (vllm/model_executor/model_loader/init.py:14) load_model (vllm/worker/model_runner.py:1112) load_model (vllm/worker/worker.py:183) run_method (vllm/utils.py:2196) collective_rpc (vllm/executor/uniproc_executor.py:56) _init_executor (vllm/executor/uniproc_executor.py:120) init (vllm/executor/executor_base.py:52) init (vllm/engine/llm_engine.py:273) from_engine_args (vllm/engine/llm_engine.py:489) init (vllm/entrypoints/llm.py:242) inner (vllm/utils.py:1022) init (verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py:100)

model: qwen25-72b/qwen25-vl-72b, vllm: 0.7.3

It seems that the worker timed out loading the vllm model, and the backend also observed that the worker process was hang in loading model weights in vllm.

JoyTim-777 avatar Mar 05 '25 13:03 JoyTim-777

same error

yiyepiaoling0715 avatar Mar 06 '25 03:03 yiyepiaoling0715

@hiyouga @vermouth1992 Could you please take a look and help me resolve it?

JoyTim-777 avatar Mar 06 '25 08:03 JoyTim-777

Please see if https://github.com/volcengine/verl/issues/491#issuecomment-2704116935 is the same issue causing timeout error?

casper-hansen avatar Mar 06 '25 15:03 casper-hansen

It doesn't seem like it, the error of #491 was in trainer.init_workers(), but my error occurred while loading vllm, the actor and ref models had already completed FSDP initialization.

JoyTim-777 avatar Mar 06 '25 15:03 JoyTim-777

Could you please confirm whether your version of verl is compatible with vLLM 0.7.3? When I tried installing it directly, I encountered the following error. I would like to know which version of your package is required.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. verl 0.2.0.dev0 requires vllm<=0.6.3, but you have vllm 0.7.3 which is incompatible.

zheng-z18 avatar Mar 08 '25 01:03 zheng-z18

When only training language models, such as qwen2.5, only vllm=0.6.3 is required, but if it is qwen2.5 vl, vllm>=0.7.2 is required

JoyTim-777 avatar Mar 08 '25 02:03 JoyTim-777

When only training language models, such as qwen2.5, only vllm=0.6.3 is required, but if it is qwen2.5 vl, vllm>=0.7.2 is required

I'm still stuck with the environment installation. If possible, could you please provide a screenshot of your conda list so that I can compare? I would really appreciate it. Thank you very much!

zheng-z18 avatar Mar 08 '25 02:03 zheng-z18

I'm sorry, for some reason I couldn't provide it now. I also encountered your mistake, but I think it can be ignored. You can see here #https://github.com/volcengine/verl/commit/b46f55ecc98e40fb36af465fdde9b7f7613e5e50. I installed it based on this commit.

JoyTim-777 avatar Mar 08 '25 03:03 JoyTim-777