[vineyard]vllm kv cache can not store into vineyard memory
🐛 Describe the bug
When testing the model inference,vllm log has printed vineyard_llm_cache.py updated kv cache data successfully in aibrix/vllm-openai contianer. But kv cache was not written to vineyard memory . The vineyard memory is not increased either.
DEBUG 05-07 02:31:46 vineyard_llm_cache.py:694] update kv cache: #prefix=3952, #tokens=16, updated=16
(VllmWorkerProcess pid=375269) DEBUG 05-07 02:31:46 vineyard_llm_cache.py:694] update kv cache: #prefix=3952, #tokens=16, updated=16
(VllmWorkerProcess pid=375270) DEBUG 05-07 02:31:46 vineyard_llm_cache.py:694] update kv cache: #prefix=3952, #tokens=16, updated=16
(VllmWorkerProcess pid=375270) DEBUG 05-07 02:31:46 vineyard_llm_cache.py:646] prefetch_kv_caches: matched={116: 495}
(VllmWorkerProcess pid=375268) DEBUG 05-07 02:31:46 vineyard_llm_cache.py:646] prefetch_kv_caches: matched={116: 495}
DEBUG 05-07 02:31:46 vineyard_llm_cache.py:646] prefetch_kv_caches: matched={116: 495}
(VllmWorkerProcess pid=375269) DEBUG 05-07 02:31:46 vineyard_llm_cache.py:646] prefetch_kv_caches: matched={116: 495}
DEBUG 05-07 02:31:46 vineyard_llm_cache.py:694] update kv cache: #prefix=4448, #tokens=16, updated=16
(VllmWorkerProcess pid=375270) DEBUG 05-07 02:31:46 vineyard_llm_cache.py:694] update kv cache: #prefix=4448, #tokens=16, updated=16
(VllmWorkerProcess pid=375268) DEBUG 05-07 02:31:46 vineyard_llm_cache.py:694] update kv cache: #prefix=4448, #tokens=16, updated=16
(VllmWorkerProcess pid=375269) DEBUG 05-07 02:31:46 vineyard_llm_cache.py:694] update kv cache: #prefix=4448, #tokens=16, updated=16
(VllmWorkerProcess pid=375268) DEBUG 05-07 02:31:46 vineyard_llm_cache.py:646] prefetch_kv_caches: matched={116: 495}
DEBUG 05-07 02:31:46 vineyard_llm_cache.py:646] prefetch_kv_caches: matched={116: 495}
Even after deleted the deepseek-coder-7b-kvcache Pod and the vineyard process, the VLLM logs still show that vineyard_llm_cache.py has successfully updated the kv cache. There is no "Failed to connect to vineyard" exception reported in the logs.
How can make the kv cache be stored in vineyard memory? Even though the vineyard process no longer exists, why the VLLM logs still print vineyard_llm_cache.py has successfully updated the kv cache?
Steps to Reproduce
apiVersion: apps/v1 kind: Deployment metadata: name: deepseek-coder-7b-instruct labels: model.aibrix.ai/name: deepseek-coder-7b-instruct model.aibrix.ai/port: "8000" spec: replicas: 1 strategy: rollingUpdate: maxSurge: 1 maxUnavailable: 1 type: RollingUpdate selector: matchLabels: model.aibrix.ai/name: deepseek-coder-7b-instruct template: metadata: labels: model.aibrix.ai/name: deepseek-coder-7b-instruct spec: containers: - name: vllm-openai image: aibrix/vllm-openai:v0.6.1-edb07092-20250118 imagePullPolicy: Always command: - python3 - -m - vllm.entrypoints.openai.api_server - --port - "8000" - --uvicorn-log-level - warning - --model - deepseek-ai/deepseek-coder-6.7b-instruct - --served-model-name - deepseek-coder-7b-instruct - --max-model-len - "8192" # please modify this field if your gpu has more room - --enable-prefix-caching - --disable-fastapi-docs env: - name: VLLM_USE_VINEYARD_CACHE value: "1" - name: VINEYARD_CACHE_CPU_MEM_LIMIT_GB value: "10" - name: AIBRIX_LLM_KV_CACHE value: "1" - name: AIBRIX_LLM_KV_CACHE_KV_CACHE_NS value: "aibrix" - name: AIBRIX_LLM_KV_CACHE_CHUNK_SIZE value: "16" - name: AIBRIX_LLM_KV_CACHE_SOCKET value: /var/run/vineyard.sock - name: AIBRIX_LLM_KV_CACHE_RPC_ENDPOINT value: "deepseek-coder-7b-kvcache-rpc:9600" - name: VINEYARD_CACHE_ENABLE_ASYNC_UPDATE value: "1" - name: "VINEYARD_CACHE_METRICS_ENABLED" value: "1" volumeMounts: - mountPath: /var/run name: kvcache-socket resources: limits: nvidia.com/gpu: "1" requests: nvidia.com/gpu: "1" volumes: - name: kvcache-socket hostPath: path: /var/run/vineyard-kubernetes/default/deepseek-coder-7b-kvcache
apiVersion: v1
kind: Service
metadata:
labels:
model.aibrix.ai/name: deepseek-coder-7b-instruct
prometheus-discovery: "true"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
name: deepseek-coder-7b-instruct # Note: The Service name must match the label value model.aibrix.ai/name in the Deployment
namespace: default
spec:
ports:
- name: serve
port: 8000
protocol: TCP
targetPort: 8000
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
model.aibrix.ai/name: deepseek-coder-7b-instruct
type: ClusterIP
Expected behavior
no crash in the beginning and no restart should be required to run the engine successfully.
Environment
AIbrix version 0.2.0 vllm image version: aibrix/vllm-openai:v0.6.1-edb07092-20250118
@stellarzhou Thanks for trying out vineyard-based kv cache. In this vineyard-based impl, there is a client-side cache within vineyard's client (its capacity is also VINEYARD_CACHE_CPU_MEM_LIMIT_GB), which utilizes S3FIFO eviction policy (refer to https://blog.jasony.me/system/cache/2023/08/01/s3fifo for more details of S3FIFO) to detect hot kv blocks. A KV block is selected to be stored into vineyard server if:
- the kv block is marked as hot by the S3FIFO eviction policy
- the kv block is in S3FIFO's main FIFO (i.e., we have to trigger evictions on the small FIFO to promote hot kv blocks to the main FIFO)
- the background loop (for storing hot kv blocks asynchronously) has run once after a kv block met 1 and 2.
You could try a large workload and run with it for a while, then you would observe some kv tensors on the server side. By the way, we will deprecate this vineyard-base impl soon in the incoming AIBrix v0.3.0 release and will release our new kv cache module, please stay tuned and we are looking forward if you could have a try on our new kv cache impl.
@DwyaneShi Thanks for indication. Is the client-side cache within vineyard's client also use DRAM of the host machine where it is located?
Does hot kv blocks in the client-side cache gc parameters also use the same parameters(AIBrixCacheConfig)as the vineyard server
VineyardLLMCache(tensor_nbytes=2048, cache_capacity=529237, layer=48, kv_cache_dtype=auto, torch_dtype=torch.bfloat16, cache=KVCache(cache_config=AIBrixCacheConfig(chunk_size=16, kv_cache_ns=default_1_0, local_sync_interval_s=180, enable_global_gc=True, global_gc_interval_s=600, global_ttl_s=480)
Is there any configuration parameters to turn off the client-side cache within Vineyard's client?