verl
verl copied to clipboard
[bug] GRPO timeout error in multi node
Error info:
File "/verl/workers/fsdp_workers.py", line 390, in init_model self.rollout, self.rollout_sharding_manager = self._build_rollout() File "/verl/workers/fsdp_workers.py", line 325, in _build_rollout rollout = vLLMRollout(model_path=local_path, File "/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 100, in init self.inference_engine = LLM( File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1022, in inner return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 242, in init self.llm_engine = self.engine_class.from_engine_args( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 489, in from_engine_args engine = cls( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 276, in init self._initialize_kv_caches() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 421, in _initialize_kv_caches self.model_executor.determine_num_available_blocks()) File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 137, in determine_num_available_blocks dist.all_reduce(a_tensor, group=cpu_group, op=dist.ReduceOp.MIN) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2506, in all_reduce work.wait() RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
worker process:
Thread 461 (idle): "MainThread"
load_tensor (torch/serialization.py:1772)
persistent_load (torch/serialization.py:1812)
load (torch/_weights_only_unpickler.py:385)
_load (torch/serialization.py:1848)
load (torch/serialization.py:1351)
pt_weights_iterator (vllm/model_executor/model_loader/weight_utils.py:460)
model: qwen25-72b/qwen25-vl-72b, vllm: 0.7.3
It seems that the worker timed out loading the vllm model, and the backend also observed that the worker process was hang in loading model weights in vllm.
same error
@hiyouga @vermouth1992 Could you please take a look and help me resolve it?
Please see if https://github.com/volcengine/verl/issues/491#issuecomment-2704116935 is the same issue causing timeout error?
It doesn't seem like it, the error of #491 was in trainer.init_workers(), but my error occurred while loading vllm, the actor and ref models had already completed FSDP initialization.
Could you please confirm whether your version of verl is compatible with vLLM 0.7.3? When I tried installing it directly, I encountered the following error. I would like to know which version of your package is required.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. verl 0.2.0.dev0 requires vllm<=0.6.3, but you have vllm 0.7.3 which is incompatible.
When only training language models, such as qwen2.5, only vllm=0.6.3 is required, but if it is qwen2.5 vl, vllm>=0.7.2 is required
When only training language models, such as qwen2.5, only vllm=0.6.3 is required, but if it is qwen2.5 vl, vllm>=0.7.2 is required
I'm still stuck with the environment installation. If possible, could you please provide a screenshot of your conda list so that I can compare? I would really appreciate it. Thank you very much!
I'm sorry, for some reason I couldn't provide it now. I also encountered your mistake, but I think it can be ignored. You can see here #https://github.com/volcengine/verl/commit/b46f55ecc98e40fb36af465fdde9b7f7613e5e50. I installed it based on this commit.