aibrix [Vineyard] vllm engine crashes, failing to connect to vineyard when starting the pod.

🐛 Describe the bug

when using kvcache, it always fails to create the vllm engine pod successfully at the first time. Engine pods become "running " only after restarting once. The following log is from the pod that was created first. It is supposed to successfully go into "running" state at the first pod creation not only after restarting. This occurs everything when deploying and it is critical.

kl llama-3-8b-instruct-5d74f6cc46-xgkmp
Defaulted container "vllm-openai" out of: vllm-openai, aibrix-runtime, init-model (init)
INFO 03-17 11:33:07 api_server.py:501] vLLM API server version 0.6.1
INFO 03-17 11:33:07 api_server.py:502] args: Namespace(host=None, port=8000, uvicorn_log_level='warning', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/models/llama-3.1-8b-instruct/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', config_format='auto', dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=15000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=0.0, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['llama-3-8b-instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, engine_use_ray=False, disable_log_requests=True, max_log_len=None, disable_fastapi_docs=False)
INFO 03-17 11:33:07 api_server.py:162] Multiprocessing frontend to use ipc:///tmp/287453f1-e030-4974-a2ee-3995f077896a for RPC Path.
INFO 03-17 11:33:08 api_server.py:178] Started engine process with PID 55
INFO 03-17 11:33:11 llm_engine.py:233] Initializing an LLM engine (v0.6.1) with config: model='/models/llama-3.1-8b-instruct/', speculative_config=None, tokenizer='/models/llama-3.1-8b-instruct/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=15000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=llama-3-8b-instruct, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=True)
INFO 03-17 11:33:12 model_runner.py:1030] Starting to load model /models/llama-3.1-8b-instruct/...
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  1.51it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.45it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.43it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  2.00it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.76it/s]

INFO 03-17 11:33:15 model_runner.py:1041] Loading model weights took 14.9888 GB
INFO 03-17 11:33:15 vineyard_llm_cache.py:296] VineyardLLMCache async update: {'enable_async_update': True, 'min_inflight_tasks': 1, 'max_inflight_tasks': 32}
INFO 03-17 11:33:15 vineyard_llm_cache.py:306] VineyardLLMCache from_envs None
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 10 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 9 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 8 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 7 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 6 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 5 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 4 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 3 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 2 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 1 more times.
No RDMA endpoint provided. Fall back to TCP.
Terminating: Running: , Waiting:
Terminating: Running: , Waiting:
Terminating: Running: , Waiting:
Terminating: Running: , Waiting:
Terminating: Running: , Waiting:
Terminating: Running: , Waiting:
Terminating: Running: , Waiting:
Terminating: Running: , Waiting:
Terminating: Running: , Waiting:
Terminating: Running: , Waiting:
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 34, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 735, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 615, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 835, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 262, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 326, in __init__
    self.model_executor = executor_class(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 41, in _init_executor
    self.driver_worker.load_model()
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 184, in load_model
    self.model_runner.load_model()
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1104, in load_model
    self._init_vineyard_cache(self.cache_service_metrics)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1010, in _init_vineyard_cache
    self.vineyard_llm_cache: VineyardLLMCache = VineyardLLMCache.from_envs(
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/vineyard_llm_cache.py", line 307, in from_envs
    return VineyardLLMCache(
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/vineyard_llm_cache.py", line 136, in __init__
    self.cache = VineyardKVCache(
  File "/usr/local/lib/python3.10/dist-packages/vineyard/llm/cache.py", line 380, in __init__
    cache_config = AIBrixCacheConfig(**config)
  File "/usr/local/lib/python3.10/dist-packages/vineyard/llm/cache.py", line 257, in __init__
    self.rpc_client = vineyard.connect(
  File "/usr/local/lib/python3.10/dist-packages/vineyard/__init__.py", line 418, in connect
    return Client(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vineyard/core/client.py", line 296, in __init__
    raise ConnectionError(
ConnectionError: Failed to connect to vineyard via both IPC and RPC connection. Arguments, environment variables `VINEYARD_IPC_SOCKET` and `VINEYARD_RPC_ENDPOINT`, as well as the configuration file, are all unavailable.
Terminating: Running: , Waiting:
ERROR 03-17 11:35:38 api_server.py:188] RPCServer process died before responding to readiness probe

NAME                                                  READY   STATUS    RESTARTS        AGE    IP          NODE        NOMINATED NODE   READINESS GATES
aibrix-kvcache-llama-3-8b-instruct-7ff8b98664-d4np4   1/1     Running   0               6m2s   10.0.1.33   10.0.1.28   <none>           <none>
aibrix-kvcache-llama-3-8b-instruct-etcd-0             1/1     Running   0               6m2s   10.0.3.12   10.0.3.8    <none>           <none>
llama-3-8b-instruct-5d74f6cc46-xgkmp                  2/2     Running   1 (3m26s ago)   6m3s   10.0.1.35   10.0.1.28   <none>           <none>

Steps to Reproduce

kind: Deployment
metadata:
  name: llama-3-8b-instruct
  labels:
    model.aibrix.ai/name: llama-3-8b-instruct
    model.aibrix.ai/port: "8000"
spec:
  replicas: 1
  strategy:
    rollingUpdate:
      maxSurge: 100%
      maxUnavailable: 100%
    type: RollingUpdate
  selector:
    matchLabels:
      model.aibrix.ai/name: llama-3-8b-instruct
  template:
    metadata:
      labels:
        model.aibrix.ai/name: llama-3-8b-instruct
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: machine.cluster.vke.volcengine.com/gpu-name
                operator: In
                values:
                - NVIDIA-L20
      containers:
        - command:
            - python3
            - -m
            - vllm.entrypoints.openai.api_server
            - --port
            - "8000"
            - --uvicorn-log-level
            - warning
            - --model
            - /models/llama-3.1-8b-instruct/
            - --served-model-name
            - llama-3-8b-instruct
            - --trust-remote-code
            - --enable-chunked-prefill
            - "false"
            - --max-model-len
            - "15000"
            - --dtype
            - bfloat16
            - --disable-log-requests
            - --swap-space
            - "0"
            # - --enable-prefix-caching
          env:
            - name: VLLM_USE_VINEYARD_CACHE
              value: "0"
            - name: VINEYARD_CACHE_CPU_MEM_LIMIT_GB
              value: "72"
            - name: AIBRIX_LLM_KV_CACHE
              value: "0"
            - name: AIBRIX_LLM_KV_CACHE_KV_CACHE_NS
              value: "aibrix"
            - name: AIBRIX_LLM_KV_CACHE_CHUNK_SIZE
              value: "16"
            - name: AIBRIX_LLM_KV_CACHE_SOCKET
              value: /var/run/vineyard.sock
            - name: AIBRIX_LLM_KV_CACHE_RPC_ENDPOINT
              value: "aibrix-kvcache-llama-3-8b-instruct-rpc:9600"
            - name: VINEYARD_CACHE_ENABLE_ASYNC_UPDATE
              value: "1"
            - name: "VINEYARD_CACHE_METRICS_ENABLED"
              value: "1"
          image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.6.1-edb07092-20250118
          imagePullPolicy: IfNotPresent
          lifecycle:
            preStop:
              exec:
                command:
                - /bin/sh
                - -c
                - |
                  while true; do
                    RUNNING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_running' | grep -v '#' | awk '{print $2}')
                    WAITING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_waiting' | grep -v '#' | awk '{print $2}')
                    if [ "$RUNNING" = "0.0" ] && [ "$WAITING" = "0.0" ]; then
                      echo "Terminating: No active or waiting requests, safe to terminate" >> /proc/1/fd/1
                      exit 0
                    else
                      echo "Terminating: Running: $RUNNING, Waiting: $WAITING" >> /proc/1/fd/1
                      sleep 5
                    fi
                  done
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /health
              port: 8000
              scheme: HTTP
            initialDelaySeconds: 90
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          name: vllm-openai
          ports:
          - containerPort: 8000
            protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /health
              port: 8000
              scheme: HTTP
            initialDelaySeconds: 90
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              cpu: 1
              nvidia.com/gpu: "1"
            requests:
              cpu: 1
              nvidia.com/gpu: "1"
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /models
            name: model-hostpath
          - mountPath: /dev/shm
            name: dshm
          - mountPath: /var/run
            name: kvcache-socket            
        - command:
          - aibrix_runtime
          - --port
          - "8080"
          env:
          - name: INFERENCE_ENGINE
            value: vllm
          - name: INFERENCE_ENGINE_ENDPOINT
            value: http://localhost:8000
          image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.2.0
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /healthz
              port: 8080
              scheme: HTTP
            initialDelaySeconds: 3
            periodSeconds: 2
            successThreshold: 1
            timeoutSeconds: 1
          name: aibrix-runtime
          ports:
          - containerPort: 8080
            protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /ready
              port: 8080
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      initContainers:
      - command:
        - aibrix_download
        - --model-uri
        - tos://aibrix-artifact-testing/models/llama-3.1-8b-instruct/
        - --local-dir
        - /models/
        env:
        - name: DOWNLOADER_MODEL_NAME
          value: llama-3.1-8b-instruct
        - name: DOWNLOADER_NUM_THREADS
          value: "16"
        - name: DOWNLOADER_ALLOW_FILE_SUFFIX
          value: json, safetensors
        - name: TOS_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              key: TOS_ACCESS_KEY
              name: tos-credential
        - name: TOS_SECRET_KEY
          valueFrom:
            secretKeyRef:
              key: TOS_SECRET_KEY
              name: tos-credential
        - name: TOS_ENDPOINT
          value: tos-cn-beijing.ivolces.com
        - name: TOS_REGION
          value: cn-beijing
        image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.1.0
        imagePullPolicy: IfNotPresent
        name: init-model
        resources: {}
        volumeMounts:
        - mountPath: /models
          name: model-hostpath
      terminationGracePeriodSeconds: 60
      volumes:
        - name: model-hostpath
          hostPath:
            path: /root/models
            type: DirectoryOrCreate
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: "4Gi"
        - name: kvcache-socket
          hostPath:
            path: /var/run/vineyard-kubernetes/default/aibrix-kvcache-llama-3-8b-instruct

---
apiVersion: v1
kind: Service
metadata:
  labels:
    model.aibrix.ai/name: llama-3-8b-instruct
    prometheus-discovery: "true"
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
  name: llama-3-8b-instruct # Note: The Service name must match the label value `model.aibrix.ai/name` in the Deployment
  namespace: default
spec:
  ports:
    - name: serve
      port: 8000
      protocol: TCP
      targetPort: 8000
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    model.aibrix.ai/name: llama-3-8b-instruct
  type: ClusterIP

---

apiVersion: orchestration.aibrix.ai/v1alpha1
kind: KVCache
metadata:
  name: aibrix-kvcache-llama-3-8b-instruct
  namespace: default
  annotations:
    kvcache.orchestration.aibrix.ai/pod-affinity-workload: llama-3-8b-instruct
    kvcache.orchestration.aibrix.ai/pod-anti-affinity: "true"
    kvcache.orchestration.aibrix.ai/node-affinity-gpu-type: NVIDIA-L20
spec:
  replicas: 1
  service:
    type: ClusterIP
    port: 9600
  cacheSpec:
    image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vineyardd:20241120
    imagePullPolicy: IfNotPresent
    cpu: "4000m"
    memory: 72Gi

Expected behavior

no crash in the beginning and no restart should be required to run the engine successfully.

Environment

main branch

Mar 17 '25 18:03 gangmuk

@Jeffwan @DwyaneShi Could you check this issue? I have been just waiting for the engine pods to be restarted. But it is not supposed to be like that. This is pretty critical. Let me know if the issue is not clear.

Mar 25 '25 20:03 gangmuk

@Jeffwan @DwyaneShi Could you check this issue? I have been just waiting for the engine pods to be restarted. But it is not supposed to be like that. This is pretty critical. Let me know if the issue is not clear.

It's more like an orchestration related problem.

The kv cache pod is configured with an affinity to the engine pod, creating an startup dependency:

the engine pod schedules and launches first
the kv cache pod then schedules and launches afterward

This order creates a potential race condition during initialization:

first launch: when the engine pod starts, the kvcache-socket mount point doesn't yet exist, causing mount failures. so it will complains that cannot find the socket file.
subsequent restarts: if the engine pod restarts while the kv cache pod is already running, the mount point exists, the engine will be init'ed successfully.

Mar 25 '25 21:03 DwyaneShi

@DwyaneShi got it. makes sense. so if I start the kvcache pod first and then engine pod, it should not happen. ?

hmm but it is also not correct since kvcache pod should know where the engine pod is running for affinity. It means engine pod should be there for kvcache pod...

how can we fix it or workaround? since we can't let the engine pods always restart by default

Mar 25 '25 21:03 gangmuk

@gangmuk removing IPC is an option. right now we actually do not heavily rely on the IPC connection, feel free to have a look at the codebase to see if it's doable.

Mar 25 '25 22:03 DwyaneShi

@DwyaneShi what do you mean by that we are not heavily relying on the IPC? you mean IPC is not needed? I thought distributed kv cache uses IPC for local node distributed kv cache fetching and RPC for remote node distributed kv cache fetching. Is it correct?

and where should I look to disable IPC connection check?

Mar 25 '25 23:03 gangmuk