aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

[vineyard]vllm kv cache can not store into vineyard memory

Open stellarzhou opened this issue 7 months ago • 2 comments

🐛 Describe the bug

When testing the model inference,vllm log has printed vineyard_llm_cache.py updated kv cache data successfully in aibrix/vllm-openai contianer. But kv cache was not written to vineyard memory . The vineyard memory is not increased either.


DEBUG 05-07 02:31:46 vineyard_llm_cache.py:694] update kv cache: #prefix=3952, #tokens=16, updated=16
(VllmWorkerProcess pid=375269) DEBUG 05-07 02:31:46 vineyard_llm_cache.py:694] update kv cache: #prefix=3952, #tokens=16, updated=16
(VllmWorkerProcess pid=375270) DEBUG 05-07 02:31:46 vineyard_llm_cache.py:694] update kv cache: #prefix=3952, #tokens=16, updated=16
(VllmWorkerProcess pid=375270) DEBUG 05-07 02:31:46 vineyard_llm_cache.py:646] prefetch_kv_caches: matched={116: 495}
(VllmWorkerProcess pid=375268) DEBUG 05-07 02:31:46 vineyard_llm_cache.py:646] prefetch_kv_caches: matched={116: 495}
DEBUG 05-07 02:31:46 vineyard_llm_cache.py:646] prefetch_kv_caches: matched={116: 495}
(VllmWorkerProcess pid=375269) DEBUG 05-07 02:31:46 vineyard_llm_cache.py:646] prefetch_kv_caches: matched={116: 495}
DEBUG 05-07 02:31:46 vineyard_llm_cache.py:694] update kv cache: #prefix=4448, #tokens=16, updated=16
(VllmWorkerProcess pid=375270) DEBUG 05-07 02:31:46 vineyard_llm_cache.py:694] update kv cache: #prefix=4448, #tokens=16, updated=16
(VllmWorkerProcess pid=375268) DEBUG 05-07 02:31:46 vineyard_llm_cache.py:694] update kv cache: #prefix=4448, #tokens=16, updated=16
(VllmWorkerProcess pid=375269) DEBUG 05-07 02:31:46 vineyard_llm_cache.py:694] update kv cache: #prefix=4448, #tokens=16, updated=16
(VllmWorkerProcess pid=375268) DEBUG 05-07 02:31:46 vineyard_llm_cache.py:646] prefetch_kv_caches: matched={116: 495}
DEBUG 05-07 02:31:46 vineyard_llm_cache.py:646] prefetch_kv_caches: matched={116: 495}

Even after deleted the deepseek-coder-7b-kvcache Pod and the vineyard process, the VLLM logs still show that vineyard_llm_cache.py has successfully updated the kv cache. There is no "Failed to connect to vineyard" exception reported in the logs.

How can make the kv cache be stored in vineyard memory? Even though the vineyard process no longer exists, why the VLLM logs still print vineyard_llm_cache.py has successfully updated the kv cache?

Steps to Reproduce

apiVersion: apps/v1 kind: Deployment metadata: name: deepseek-coder-7b-instruct labels: model.aibrix.ai/name: deepseek-coder-7b-instruct model.aibrix.ai/port: "8000" spec: replicas: 1 strategy: rollingUpdate: maxSurge: 1 maxUnavailable: 1 type: RollingUpdate selector: matchLabels: model.aibrix.ai/name: deepseek-coder-7b-instruct template: metadata: labels: model.aibrix.ai/name: deepseek-coder-7b-instruct spec: containers: - name: vllm-openai image: aibrix/vllm-openai:v0.6.1-edb07092-20250118 imagePullPolicy: Always command: - python3 - -m - vllm.entrypoints.openai.api_server - --port - "8000" - --uvicorn-log-level - warning - --model - deepseek-ai/deepseek-coder-6.7b-instruct - --served-model-name - deepseek-coder-7b-instruct - --max-model-len - "8192" # please modify this field if your gpu has more room - --enable-prefix-caching - --disable-fastapi-docs env: - name: VLLM_USE_VINEYARD_CACHE value: "1" - name: VINEYARD_CACHE_CPU_MEM_LIMIT_GB value: "10" - name: AIBRIX_LLM_KV_CACHE value: "1" - name: AIBRIX_LLM_KV_CACHE_KV_CACHE_NS value: "aibrix" - name: AIBRIX_LLM_KV_CACHE_CHUNK_SIZE value: "16" - name: AIBRIX_LLM_KV_CACHE_SOCKET value: /var/run/vineyard.sock - name: AIBRIX_LLM_KV_CACHE_RPC_ENDPOINT value: "deepseek-coder-7b-kvcache-rpc:9600" - name: VINEYARD_CACHE_ENABLE_ASYNC_UPDATE value: "1" - name: "VINEYARD_CACHE_METRICS_ENABLED" value: "1" volumeMounts: - mountPath: /var/run name: kvcache-socket resources: limits: nvidia.com/gpu: "1" requests: nvidia.com/gpu: "1" volumes: - name: kvcache-socket hostPath: path: /var/run/vineyard-kubernetes/default/deepseek-coder-7b-kvcache


apiVersion: v1 kind: Service metadata: labels: model.aibrix.ai/name: deepseek-coder-7b-instruct prometheus-discovery: "true" annotations: prometheus.io/scrape: "true" prometheus.io/port: "8080" name: deepseek-coder-7b-instruct # Note: The Service name must match the label value model.aibrix.ai/name in the Deployment namespace: default spec: ports: - name: serve port: 8000 protocol: TCP targetPort: 8000 - name: http port: 8080 protocol: TCP targetPort: 8080 selector: model.aibrix.ai/name: deepseek-coder-7b-instruct type: ClusterIP

Expected behavior

no crash in the beginning and no restart should be required to run the engine successfully.

Environment

AIbrix version 0.2.0 vllm image version: aibrix/vllm-openai:v0.6.1-edb07092-20250118

stellarzhou avatar May 07 '25 14:05 stellarzhou

@stellarzhou Thanks for trying out vineyard-based kv cache. In this vineyard-based impl, there is a client-side cache within vineyard's client (its capacity is also VINEYARD_CACHE_CPU_MEM_LIMIT_GB), which utilizes S3FIFO eviction policy (refer to https://blog.jasony.me/system/cache/2023/08/01/s3fifo for more details of S3FIFO) to detect hot kv blocks. A KV block is selected to be stored into vineyard server if:

  1. the kv block is marked as hot by the S3FIFO eviction policy
  2. the kv block is in S3FIFO's main FIFO (i.e., we have to trigger evictions on the small FIFO to promote hot kv blocks to the main FIFO)
  3. the background loop (for storing hot kv blocks asynchronously) has run once after a kv block met 1 and 2.

You could try a large workload and run with it for a while, then you would observe some kv tensors on the server side. By the way, we will deprecate this vineyard-base impl soon in the incoming AIBrix v0.3.0 release and will release our new kv cache module, please stay tuned and we are looking forward if you could have a try on our new kv cache impl.

DwyaneShi avatar May 07 '25 20:05 DwyaneShi

@DwyaneShi Thanks for indication. Is the client-side cache within vineyard's client also use DRAM of the host machine where it is located?

Does hot kv blocks in the client-side cache gc parameters also use the same parameters(AIBrixCacheConfig)as the vineyard server

VineyardLLMCache(tensor_nbytes=2048, cache_capacity=529237, layer=48, kv_cache_dtype=auto, torch_dtype=torch.bfloat16, cache=KVCache(cache_config=AIBrixCacheConfig(chunk_size=16, kv_cache_ns=default_1_0, local_sync_interval_s=180, enable_global_gc=True, global_gc_interval_s=600, global_ttl_s=480)

Is there any configuration parameters to turn off the client-side cache within Vineyard's client?

stellarzhou avatar May 08 '25 06:05 stellarzhou