Failed to connect to vineyard via both IPC and RPC connection
🐛 Describe the bug
INFO 02-17 17:03:44 model_runner.py:1041] Loading model weights took 12.5708 GB
INFO 02-17 17:03:44 vineyard_llm_cache.py:296] VineyardLLMCache async update: {'enable_async_update': True, 'min_inflight_tasks': 1, 'max_inflight_tasks': 8}
INFO 02-17 17:03:44 vineyard_llm_cache.py:306] VineyardLLMCache from_envs None
No RDMA endpoint provided. Fall back to TCP.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 10 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 9 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 8 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 7 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 6 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 5 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 4 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 3 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 2 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 1 more times.
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 34, in __init__
self.engine = AsyncLLMEngine.from_engine_args(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 735, in from_engine_args
engine = cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 615, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 835, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 262, in __init__
super().__init__(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 326, in __init__
self.model_executor = executor_class(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
self._init_executor()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 41, in _init_executor
self.driver_worker.load_model()
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 184, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1104, in load_model
self._init_vineyard_cache(self.cache_service_metrics)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1010, in _init_vineyard_cache
self.vineyard_llm_cache: VineyardLLMCache = VineyardLLMCache.from_envs(
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/vineyard_llm_cache.py", line 307, in from_envs
return VineyardLLMCache(
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/vineyard_llm_cache.py", line 136, in __init__
self.cache = VineyardKVCache(
File "/usr/local/lib/python3.10/dist-packages/vineyard/llm/cache.py", line 380, in __init__
cache_config = AIBrixCacheConfig(**config)
File "/usr/local/lib/python3.10/dist-packages/vineyard/llm/cache.py", line 257, in __init__
self.rpc_client = vineyard.connect(
File "/usr/local/lib/python3.10/dist-packages/vineyard/__init__.py", line 418, in connect
return Client(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vineyard/core/client.py", line 296, in __init__
raise ConnectionError(
ConnectionError: Failed to connect to vineyard via both IPC and RPC connection. Arguments, environment variables `VINEYARD_IPC_SOCKET` and `VINEYARD_RPC_ENDPOINT`, as well as the configuration file, are all unavailable.
ERROR 02-17 17:03:58 api_server.py:188] RPCServer process died before responding to readiness probe
Steps to Reproduce
kubectl apply -f samples/kvcache/deployment.yaml
kubectl apply -f samples/kvcache/kvcache.yaml
Expected behavior
inference engine should launch successfully
Environment
- nightly version
I used wrong endpoint here but even I use wrong one, does IPC connection helps?
- name: AIBRIX_LLM_KV_CACHE_RPC_ENDPOINT
value: "aibrix-mode-deepseek-coder-7b-kvcache-rpc:9600"
It should be deepseek-coder-7b-kvcache-rpc:9600
I used wrong endpoint here but even I use wrong one, does IPC connection helps?
- name: AIBRIX_LLM_KV_CACHE_RPC_ENDPOINT value: "aibrix-mode-deepseek-coder-7b-kvcache-rpc:9600"It should be
deepseek-coder-7b-kvcache-rpc:9600
RPC is a must in our current implementation.
em. the logs is kind of misleading. it complain the "Failed to connect to vineyard via both IPC and RPC connection". Technically, it should be able to connect to the cache via IPC? Is it possible it failed to connect via IPC but only RPC and following requests all send via RPC? Do we have monitoring or logs to verify the data path?
I have the similiar case.The pod Connection to IPC socket failed.
INFO 03-24 01:06:39 vineyard_llm_cache.py:306] VineyardLLMCache from_envs None
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 10 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 9 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 8 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 7 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 6 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 5 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 4 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 3 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 2 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 1 more times.
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 34, in __init__
self.engine = AsyncLLMEngine.from_engine_args(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 735, in from_engine_args
engine = cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 615, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 835, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 262, in __init__
super().__init__(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 326, in __init__
self.model_executor = executor_class(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
self._init_executor()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 41, in _init_executor
self.driver_worker.load_model()
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 184, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1104, in load_model
self._init_vineyard_cache(self.cache_service_metrics)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1010, in _init_vineyard_cache
self.vineyard_llm_cache: VineyardLLMCache = VineyardLLMCache.from_envs(
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/vineyard_llm_cache.py", line 307, in from_envs
return VineyardLLMCache(
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/vineyard_llm_cache.py", line 136, in __init__
self.cache = VineyardKVCache(
File "/usr/local/lib/python3.10/dist-packages/vineyard/llm/cache.py", line 380, in __init__
cache_config = AIBrixCacheConfig(**config)
File "/usr/local/lib/python3.10/dist-packages/vineyard/llm/cache.py", line 252, in __init__
self.ipc_client = vineyard.connect(socket).ipc_client
File "/usr/local/lib/python3.10/dist-packages/vineyard/__init__.py", line 418, in connect
return Client(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vineyard/core/client.py", line 283, in __init__
self._ipc_client = _connect(socket, **kwargs)
vineyard._C.ConnectionFailedException: Connection failed: Failed to connect to vineyardd: Failed to connect to IPC socket: /var/run/vineyard.sock
ERROR 03-24 01:06:52 api_server.py:188] RPCServer process died before responding to readiness probe
I change the default to my namespace, and the pod Connection to IPC socket failed already resolved.
volumes:
- name: kvcache-socket
hostPath:
path: /var/run/vineyard-kubernetes/default/deepseek-coder-7b-kvcache
@ying2025 v0.2.0 vineyard based distributed kv cache has some limitations. It requires both IPC and RPC server ready.
Your above issue is similar to https://github.com/vllm-project/aibrix/issues/1012 you just created. it's more related to the scheduling. vineyard currently share same socket name and mount to host path. So technically, it only allow one instance in each node. the problem is a little bit different from this one.
For this issue, the root cause is my careless, I used unmatched RPC Service name. User who deploy the kv cache server need to guarantee the correctness of the configuration.
Both IPC and PRC are required for connection. Since we decide to deprecate the Vineyard cache server in v0.3.0, we won't spend time further improving this data path (either IPC or RPC). We can close this issue
Service
ok, thank you very much
For this issue, the root cause is my careless, I used unmatched RPC Service name. User who deploy the kv cache server need to guarantee the correctness of the configuration.
Both IPC and PRC are required for connection. Since we decide to deprecate the Vineyard cache server in v0.3.0, we won't spend time further improving this data path (either IPC or RPC). We can close this issue
Can you provide more examples about kv cache?