[Vineyard] vllm engine crashes, failing to connect to vineyard when starting the pod.
🐛 Describe the bug
when using kvcache, it always fails to create the vllm engine pod successfully at the first time. Engine pods become "running " only after restarting once. The following log is from the pod that was created first. It is supposed to successfully go into "running" state at the first pod creation not only after restarting. This occurs everything when deploying and it is critical.
kl llama-3-8b-instruct-5d74f6cc46-xgkmp
Defaulted container "vllm-openai" out of: vllm-openai, aibrix-runtime, init-model (init)
INFO 03-17 11:33:07 api_server.py:501] vLLM API server version 0.6.1
INFO 03-17 11:33:07 api_server.py:502] args: Namespace(host=None, port=8000, uvicorn_log_level='warning', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/models/llama-3.1-8b-instruct/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', config_format='auto', dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=15000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=0.0, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['llama-3-8b-instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, engine_use_ray=False, disable_log_requests=True, max_log_len=None, disable_fastapi_docs=False)
INFO 03-17 11:33:07 api_server.py:162] Multiprocessing frontend to use ipc:///tmp/287453f1-e030-4974-a2ee-3995f077896a for RPC Path.
INFO 03-17 11:33:08 api_server.py:178] Started engine process with PID 55
INFO 03-17 11:33:11 llm_engine.py:233] Initializing an LLM engine (v0.6.1) with config: model='/models/llama-3.1-8b-instruct/', speculative_config=None, tokenizer='/models/llama-3.1-8b-instruct/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=15000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=llama-3-8b-instruct, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=True)
INFO 03-17 11:33:12 model_runner.py:1030] Starting to load model /models/llama-3.1-8b-instruct/...
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:01, 1.51it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.45it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.43it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 2.00it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.76it/s]
INFO 03-17 11:33:15 model_runner.py:1041] Loading model weights took 14.9888 GB
INFO 03-17 11:33:15 vineyard_llm_cache.py:296] VineyardLLMCache async update: {'enable_async_update': True, 'min_inflight_tasks': 1, 'max_inflight_tasks': 32}
INFO 03-17 11:33:15 vineyard_llm_cache.py:306] VineyardLLMCache from_envs None
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 10 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 9 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 8 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 7 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 6 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 5 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 4 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 3 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 2 more times.
[info] Connection to IPC socket failed for pathname /var/run/vineyard.sock with ret = IOError: Cannot connect to /var/run/vineyard.sock: No such file or directory, retrying 1 more times.
No RDMA endpoint provided. Fall back to TCP.
Terminating: Running: , Waiting:
Terminating: Running: , Waiting:
Terminating: Running: , Waiting:
Terminating: Running: , Waiting:
Terminating: Running: , Waiting:
Terminating: Running: , Waiting:
Terminating: Running: , Waiting:
Terminating: Running: , Waiting:
Terminating: Running: , Waiting:
Terminating: Running: , Waiting:
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 34, in __init__
self.engine = AsyncLLMEngine.from_engine_args(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 735, in from_engine_args
engine = cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 615, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 835, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 262, in __init__
super().__init__(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 326, in __init__
self.model_executor = executor_class(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
self._init_executor()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 41, in _init_executor
self.driver_worker.load_model()
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 184, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1104, in load_model
self._init_vineyard_cache(self.cache_service_metrics)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1010, in _init_vineyard_cache
self.vineyard_llm_cache: VineyardLLMCache = VineyardLLMCache.from_envs(
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/vineyard_llm_cache.py", line 307, in from_envs
return VineyardLLMCache(
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/vineyard_llm_cache.py", line 136, in __init__
self.cache = VineyardKVCache(
File "/usr/local/lib/python3.10/dist-packages/vineyard/llm/cache.py", line 380, in __init__
cache_config = AIBrixCacheConfig(**config)
File "/usr/local/lib/python3.10/dist-packages/vineyard/llm/cache.py", line 257, in __init__
self.rpc_client = vineyard.connect(
File "/usr/local/lib/python3.10/dist-packages/vineyard/__init__.py", line 418, in connect
return Client(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vineyard/core/client.py", line 296, in __init__
raise ConnectionError(
ConnectionError: Failed to connect to vineyard via both IPC and RPC connection. Arguments, environment variables `VINEYARD_IPC_SOCKET` and `VINEYARD_RPC_ENDPOINT`, as well as the configuration file, are all unavailable.
Terminating: Running: , Waiting:
ERROR 03-17 11:35:38 api_server.py:188] RPCServer process died before responding to readiness probe
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
aibrix-kvcache-llama-3-8b-instruct-7ff8b98664-d4np4 1/1 Running 0 6m2s 10.0.1.33 10.0.1.28 <none> <none>
aibrix-kvcache-llama-3-8b-instruct-etcd-0 1/1 Running 0 6m2s 10.0.3.12 10.0.3.8 <none> <none>
llama-3-8b-instruct-5d74f6cc46-xgkmp 2/2 Running 1 (3m26s ago) 6m3s 10.0.1.35 10.0.1.28 <none> <none>
Steps to Reproduce
kind: Deployment
metadata:
name: llama-3-8b-instruct
labels:
model.aibrix.ai/name: llama-3-8b-instruct
model.aibrix.ai/port: "8000"
spec:
replicas: 1
strategy:
rollingUpdate:
maxSurge: 100%
maxUnavailable: 100%
type: RollingUpdate
selector:
matchLabels:
model.aibrix.ai/name: llama-3-8b-instruct
template:
metadata:
labels:
model.aibrix.ai/name: llama-3-8b-instruct
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: machine.cluster.vke.volcengine.com/gpu-name
operator: In
values:
- NVIDIA-L20
containers:
- command:
- python3
- -m
- vllm.entrypoints.openai.api_server
- --port
- "8000"
- --uvicorn-log-level
- warning
- --model
- /models/llama-3.1-8b-instruct/
- --served-model-name
- llama-3-8b-instruct
- --trust-remote-code
- --enable-chunked-prefill
- "false"
- --max-model-len
- "15000"
- --dtype
- bfloat16
- --disable-log-requests
- --swap-space
- "0"
# - --enable-prefix-caching
env:
- name: VLLM_USE_VINEYARD_CACHE
value: "0"
- name: VINEYARD_CACHE_CPU_MEM_LIMIT_GB
value: "72"
- name: AIBRIX_LLM_KV_CACHE
value: "0"
- name: AIBRIX_LLM_KV_CACHE_KV_CACHE_NS
value: "aibrix"
- name: AIBRIX_LLM_KV_CACHE_CHUNK_SIZE
value: "16"
- name: AIBRIX_LLM_KV_CACHE_SOCKET
value: /var/run/vineyard.sock
- name: AIBRIX_LLM_KV_CACHE_RPC_ENDPOINT
value: "aibrix-kvcache-llama-3-8b-instruct-rpc:9600"
- name: VINEYARD_CACHE_ENABLE_ASYNC_UPDATE
value: "1"
- name: "VINEYARD_CACHE_METRICS_ENABLED"
value: "1"
image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.6.1-edb07092-20250118
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
while true; do
RUNNING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_running' | grep -v '#' | awk '{print $2}')
WAITING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_waiting' | grep -v '#' | awk '{print $2}')
if [ "$RUNNING" = "0.0" ] && [ "$WAITING" = "0.0" ]; then
echo "Terminating: No active or waiting requests, safe to terminate" >> /proc/1/fd/1
exit 0
else
echo "Terminating: Running: $RUNNING, Waiting: $WAITING" >> /proc/1/fd/1
sleep 5
fi
done
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 8000
scheme: HTTP
initialDelaySeconds: 90
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
name: vllm-openai
ports:
- containerPort: 8000
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 8000
scheme: HTTP
initialDelaySeconds: 90
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 1
nvidia.com/gpu: "1"
requests:
cpu: 1
nvidia.com/gpu: "1"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /models
name: model-hostpath
- mountPath: /dev/shm
name: dshm
- mountPath: /var/run
name: kvcache-socket
- command:
- aibrix_runtime
- --port
- "8080"
env:
- name: INFERENCE_ENGINE
value: vllm
- name: INFERENCE_ENGINE_ENDPOINT
value: http://localhost:8000
image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.2.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 3
periodSeconds: 2
successThreshold: 1
timeoutSeconds: 1
name: aibrix-runtime
ports:
- containerPort: 8080
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /ready
port: 8080
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
initContainers:
- command:
- aibrix_download
- --model-uri
- tos://aibrix-artifact-testing/models/llama-3.1-8b-instruct/
- --local-dir
- /models/
env:
- name: DOWNLOADER_MODEL_NAME
value: llama-3.1-8b-instruct
- name: DOWNLOADER_NUM_THREADS
value: "16"
- name: DOWNLOADER_ALLOW_FILE_SUFFIX
value: json, safetensors
- name: TOS_ACCESS_KEY
valueFrom:
secretKeyRef:
key: TOS_ACCESS_KEY
name: tos-credential
- name: TOS_SECRET_KEY
valueFrom:
secretKeyRef:
key: TOS_SECRET_KEY
name: tos-credential
- name: TOS_ENDPOINT
value: tos-cn-beijing.ivolces.com
- name: TOS_REGION
value: cn-beijing
image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.1.0
imagePullPolicy: IfNotPresent
name: init-model
resources: {}
volumeMounts:
- mountPath: /models
name: model-hostpath
terminationGracePeriodSeconds: 60
volumes:
- name: model-hostpath
hostPath:
path: /root/models
type: DirectoryOrCreate
- name: dshm
emptyDir:
medium: Memory
sizeLimit: "4Gi"
- name: kvcache-socket
hostPath:
path: /var/run/vineyard-kubernetes/default/aibrix-kvcache-llama-3-8b-instruct
---
apiVersion: v1
kind: Service
metadata:
labels:
model.aibrix.ai/name: llama-3-8b-instruct
prometheus-discovery: "true"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
name: llama-3-8b-instruct # Note: The Service name must match the label value `model.aibrix.ai/name` in the Deployment
namespace: default
spec:
ports:
- name: serve
port: 8000
protocol: TCP
targetPort: 8000
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
model.aibrix.ai/name: llama-3-8b-instruct
type: ClusterIP
---
apiVersion: orchestration.aibrix.ai/v1alpha1
kind: KVCache
metadata:
name: aibrix-kvcache-llama-3-8b-instruct
namespace: default
annotations:
kvcache.orchestration.aibrix.ai/pod-affinity-workload: llama-3-8b-instruct
kvcache.orchestration.aibrix.ai/pod-anti-affinity: "true"
kvcache.orchestration.aibrix.ai/node-affinity-gpu-type: NVIDIA-L20
spec:
replicas: 1
service:
type: ClusterIP
port: 9600
cacheSpec:
image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vineyardd:20241120
imagePullPolicy: IfNotPresent
cpu: "4000m"
memory: 72Gi
Expected behavior
no crash in the beginning and no restart should be required to run the engine successfully.
Environment
main branch
@Jeffwan @DwyaneShi Could you check this issue? I have been just waiting for the engine pods to be restarted. But it is not supposed to be like that. This is pretty critical. Let me know if the issue is not clear.
@Jeffwan @DwyaneShi Could you check this issue? I have been just waiting for the engine pods to be restarted. But it is not supposed to be like that. This is pretty critical. Let me know if the issue is not clear.
It's more like an orchestration related problem.
The kv cache pod is configured with an affinity to the engine pod, creating an startup dependency:
- the engine pod schedules and launches first
- the kv cache pod then schedules and launches afterward
This order creates a potential race condition during initialization:
- first launch: when the engine pod starts, the kvcache-socket mount point doesn't yet exist, causing mount failures. so it will complains that cannot find the socket file.
- subsequent restarts: if the engine pod restarts while the kv cache pod is already running, the mount point exists, the engine will be init'ed successfully.
@DwyaneShi got it. makes sense. so if I start the kvcache pod first and then engine pod, it should not happen. ?
hmm but it is also not correct since kvcache pod should know where the engine pod is running for affinity. It means engine pod should be there for kvcache pod...
how can we fix it or workaround? since we can't let the engine pods always restart by default
@gangmuk removing IPC is an option. right now we actually do not heavily rely on the IPC connection, feel free to have a look at the codebase to see if it's doable.
@DwyaneShi what do you mean by that we are not heavily relying on the IPC? you mean IPC is not needed? I thought distributed kv cache uses IPC for local node distributed kv cache fetching and RPC for remote node distributed kv cache fetching. Is it correct?
and where should I look to disable IPC connection check?