aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

Failed to connect to vineyard via both IPC and RPC connection

Open Jeffwan opened this issue 10 months ago • 4 comments

🐛 Describe the bug


INFO 02-17 17:03:44 model_runner.py:1041] Loading model weights took 12.5708 GB
INFO 02-17 17:03:44 vineyard_llm_cache.py:296] VineyardLLMCache async update: {'enable_async_update': True, 'min_inflight_tasks': 1, 'max_inflight_tasks': 8}
INFO 02-17 17:03:44 vineyard_llm_cache.py:306] VineyardLLMCache from_envs None
No RDMA endpoint provided. Fall back to TCP.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 10 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 9 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 8 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 7 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 6 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 5 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 4 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 3 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 2 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 1 more times.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 34, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 735, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 615, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 835, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 262, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 326, in __init__
    self.model_executor = executor_class(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 41, in _init_executor
    self.driver_worker.load_model()
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 184, in load_model
    self.model_runner.load_model()
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1104, in load_model
    self._init_vineyard_cache(self.cache_service_metrics)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1010, in _init_vineyard_cache
    self.vineyard_llm_cache: VineyardLLMCache = VineyardLLMCache.from_envs(
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/vineyard_llm_cache.py", line 307, in from_envs
    return VineyardLLMCache(
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/vineyard_llm_cache.py", line 136, in __init__
    self.cache = VineyardKVCache(
  File "/usr/local/lib/python3.10/dist-packages/vineyard/llm/cache.py", line 380, in __init__
    cache_config = AIBrixCacheConfig(**config)
  File "/usr/local/lib/python3.10/dist-packages/vineyard/llm/cache.py", line 257, in __init__
    self.rpc_client = vineyard.connect(
  File "/usr/local/lib/python3.10/dist-packages/vineyard/__init__.py", line 418, in connect
    return Client(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vineyard/core/client.py", line 296, in __init__
    raise ConnectionError(
ConnectionError: Failed to connect to vineyard via both IPC and RPC connection. Arguments, environment variables `VINEYARD_IPC_SOCKET` and `VINEYARD_RPC_ENDPOINT`, as well as the configuration file, are all unavailable.
ERROR 02-17 17:03:58 api_server.py:188] RPCServer process died before responding to readiness probe

Steps to Reproduce

kubectl apply -f samples/kvcache/deployment.yaml
kubectl apply -f samples/kvcache/kvcache.yaml

Expected behavior

inference engine should launch successfully

Environment

  • nightly version

Jeffwan avatar Feb 18 '25 01:02 Jeffwan

I used wrong endpoint here but even I use wrong one, does IPC connection helps?

            - name: AIBRIX_LLM_KV_CACHE_RPC_ENDPOINT
              value: "aibrix-mode-deepseek-coder-7b-kvcache-rpc:9600"

It should be deepseek-coder-7b-kvcache-rpc:9600

Jeffwan avatar Feb 18 '25 06:02 Jeffwan

I used wrong endpoint here but even I use wrong one, does IPC connection helps?

            - name: AIBRIX_LLM_KV_CACHE_RPC_ENDPOINT
              value: "aibrix-mode-deepseek-coder-7b-kvcache-rpc:9600"

It should be deepseek-coder-7b-kvcache-rpc:9600

RPC is a must in our current implementation.

DwyaneShi avatar Feb 18 '25 06:02 DwyaneShi

em. the logs is kind of misleading. it complain the "Failed to connect to vineyard via both IPC and RPC connection". Technically, it should be able to connect to the cache via IPC? Is it possible it failed to connect via IPC but only RPC and following requests all send via RPC? Do we have monitoring or logs to verify the data path?

Jeffwan avatar Feb 18 '25 17:02 Jeffwan

I have the similiar case.The pod Connection to IPC socket failed.

INFO 03-24 01:06:39 vineyard_llm_cache.py:306] VineyardLLMCache from_envs None
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 10 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 9 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 8 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 7 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 6 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 5 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 4 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 3 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 2 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 1 more times.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 34, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 735, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 615, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 835, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 262, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 326, in __init__
    self.model_executor = executor_class(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 41, in _init_executor
    self.driver_worker.load_model()
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 184, in load_model
    self.model_runner.load_model()
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1104, in load_model
    self._init_vineyard_cache(self.cache_service_metrics)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1010, in _init_vineyard_cache
    self.vineyard_llm_cache: VineyardLLMCache = VineyardLLMCache.from_envs(
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/vineyard_llm_cache.py", line 307, in from_envs
    return VineyardLLMCache(
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/vineyard_llm_cache.py", line 136, in __init__
    self.cache = VineyardKVCache(
  File "/usr/local/lib/python3.10/dist-packages/vineyard/llm/cache.py", line 380, in __init__
    cache_config = AIBrixCacheConfig(**config)
  File "/usr/local/lib/python3.10/dist-packages/vineyard/llm/cache.py", line 252, in __init__
    self.ipc_client = vineyard.connect(socket).ipc_client
  File "/usr/local/lib/python3.10/dist-packages/vineyard/__init__.py", line 418, in connect
    return Client(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vineyard/core/client.py", line 283, in __init__
    self._ipc_client = _connect(socket, **kwargs)
vineyard._C.ConnectionFailedException: Connection failed: Failed to connect to vineyardd: Failed to connect to IPC socket: /var/run/vineyard.sock
ERROR 03-24 01:06:52 api_server.py:188] RPCServer process died before responding to readiness probe

I change the default to my namespace, and the pod Connection to IPC socket failed already resolved.

volumes:
        - name: kvcache-socket
          hostPath:
            path: /var/run/vineyard-kubernetes/default/deepseek-coder-7b-kvcache

ying2025 avatar Mar 24 '25 08:03 ying2025

@ying2025 v0.2.0 vineyard based distributed kv cache has some limitations. It requires both IPC and RPC server ready.

Your above issue is similar to https://github.com/vllm-project/aibrix/issues/1012 you just created. it's more related to the scheduling. vineyard currently share same socket name and mount to host path. So technically, it only allow one instance in each node. the problem is a little bit different from this one.

Jeffwan avatar Apr 28 '25 22:04 Jeffwan

For this issue, the root cause is my careless, I used unmatched RPC Service name. User who deploy the kv cache server need to guarantee the correctness of the configuration.

Both IPC and PRC are required for connection. Since we decide to deprecate the Vineyard cache server in v0.3.0, we won't spend time further improving this data path (either IPC or RPC). We can close this issue

Jeffwan avatar Apr 28 '25 22:04 Jeffwan

Service

ok, thank you very much

ying2025 avatar Apr 29 '25 03:04 ying2025

For this issue, the root cause is my careless, I used unmatched RPC Service name. User who deploy the kv cache server need to guarantee the correctness of the configuration.

Both IPC and PRC are required for connection. Since we decide to deprecate the Vineyard cache server in v0.3.0, we won't spend time further improving this data path (either IPC or RPC). We can close this issue

Can you provide more examples about kv cache?

ying2025 avatar May 06 '25 07:05 ying2025