vllm init failed
Has anyone run into the following bug during GRPO training? Any suggestions on how to fix this issue?
INFO 11-17 15:00:58 [init.py:244] Automatically detected platform cuda. INFO 11-17 15:01:01 [core.py:526] Waiting for init message from front-end. INFO 11-17 15:01:01 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='/root/work/filestorage/gaoshan/models/Qwen2_5-VL-7B-Instruct', speculative_config=None, tokenizer='/root/work/filestorage/gaoshan/models/Qwen2_5-VL-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/root/work/filestorage/gaoshan/models/Qwen2_5-VL-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null} [E1117 15:10:51.175462273 socket.cpp:1019] [c10d] The client socket has timed out after 600000ms while trying to connect to (IP Adress, Port).