store/retrieve hidden states in PD Disagg
This PR allows vllm lmcache connector to store/retrieve hidden_states in PD disaggregation. So the first iteration in the consumer side does not need to re-compute.
End-to-End verified with LLama as well as DeepSeek.
@KuntaiDu @YaoJiayi
This commit is trying to transfer the hiddenstate from prefill to decode via LMCache (with remote kv cache storage) within following diagram
+----------------+ +----------------+
| | | |
| Prefill | | Decode |
| | | |
+----------------+ +----------------+
| | | |
| LMCache | | LMCache |
| | | |
+------+---------+ +--+-------------+
| |
| put | get
| Kv Cache | Kv Cache
| Hidden States | Hidden State
v v
+------+-------------------------------+-------------+
| |
| |
| Remote Kv Cache Storage |
| |
| |
+----------------------------------------------------+
Without this commit, the last token will be recompute on the Decode node, which will be prioritized by vllm scheduler and will block the decode forward.
The recompute log info on decode worker is like
DEBUG LMCache: Injected token number: 2133 [2025-02-25 23:16:07,504] -- /data/rain_dev/LMCache/lmcache/integration/vllm/vllm_adapter.py:740
DEBUG LMCache: Rebuilt the input! [2025-02-25 23:16:07,508] -- /data/rain_dev/LMCache/lmcache/integration/vllm/vllm_adapter.py:768
@chenqianfzh Thanks for your effort! Is the deepseek you motioned is R1? Would you like to provide an Image(maybe push to dockerhub) which can run vllm and DeepSeekV3? I would like to test it in our environment.
@chenqianfzh @rainj-me Just curious, how much overhead will it introduce if we do not save KV cache but let decoding instance to decode 1 token
@chenqianfzh @rainj-me Just curious, how much overhead will it introduce if we do not save KV cache but let decoding instance to decode 1 token
The problem is the last token prefill will be scheduled by vllm with higher priority and will block the decode forward sequence. This will lead to both increase the TTFT and TPOT latency.
@YaoJiayi Thanks for your comments. Wonder whether you can take another look as I updated my PR? Thanks.
Those applications who uses disaggregated prefill are typically sensitive to latency. Decode one token basically introduces one inter-token latency, which is 10-20ms. So I agree with @rainj-me that LMCache should handle hidden states.
@chenqianfzh Could you please help to give me a PD Disagg example which can run vllm+lmcache?
Reference to the PyNcclConnector, I have to specific arguments like the following
# First node
python3 -m vllm.entrypoints.openai.api_server --dtype=half --model /root/.cache/huggingface \
--trust-remote-code --served-model-name Qwen2.5-1.5B-Instruct \
--port 8100 --max-model-len 10000 --gpu-memory-utilization 0.6 \
--kv-transfer-config '{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":5e9,"kv_ip":"192.168.1.243"}'
# Second node
python3 -m vllm.entrypoints.openai.api_server --dtype=half --model /root/.cache/huggingface \
--trust-remote-code --served-model-name Qwen2.5-1.5B-Instruct \
--port 8100 --max-model-len 10000 --gpu-memory-utilization 0.6 \
--kv-transfer-config '{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2,"kv_buffer_size":5e9,"kv_ip":"192.168.1.243"}'
# Start a proxy server to route request to P and D
https://github.com/vllm-project/vllm/blob/bc6ccb987877000ec271e0076317b03a66cde4bc/benchmarks/disagg_benchmarks/disagg_prefill_proxy_server.py
# Send request via http
curl http://localhost:8100/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen2.5-1.5B-Instruct",
"messages": [{"role": "user", "content": "How to sleep fast?"}]
}'
@chenqianfzh Could you please help to give me a PD Disagg example which can run vllm+lmcache?
Reference to the
PyNcclConnector, I have to specific arguments like the following
@rainj-me is preparing a blog about the details, and he said he has talked to you offline about it.
However, I can share the setup I used in my tests:
In the producer side:
LMCACHE_USE_EXPERIMENTAL=True LMCACHE_CONFIG_FILE=/data/example.yaml vllm serve /data/models/Llama-3.1-8B-Instruct --port 7080 --trust-remote-code --enforce-eager --gpu-memory-utilization 0.9 --distributed-executor-backend mp --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_producer"}'
In the soncusmer side ( it is in a different host):
LMCACHE_USE_EXPERIMENTAL=True LMCACHE_CONFIG_FILE=/data/example.yaml vllm serve /data/models/Llama-3.1-8B-Instruct --port 7080 --trust-remote-code --enforce-eager --gpu-memory-utilization 0.9 --distributed-executor-backend mp --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_consumer"}'
and the lmcache config file is as:
chunk_size: 64
local_device: "cpu"
remote_url: "lm://192.168.12.6:12080"
remote_serde: "naive"
enable_blending: False
max_local_cpu_size: 80
# Whether retrieve() is pipelined or not
pipelined_backend: False
We have own proxy, whose code is available in :
https://github.com/bd-iaas-us/vllm/blob/lmcache_connector_from072/vllm/distributed/kv_transfer/kv_proxy/proxy.py
Please let me know if anything else is needed.
@chenqianfzh Thanks for your kindly help, this is very useful for me, and I have run successfully vLLM+LMCache DP disagg by your document, thanks so much!
BTW, Whether LMCache can works with DeepSeekR1 after merge this PR?
@chenqianfzh I have applied this patch to our test env and did a test of PD-disagg on DeepSeek-V2-Lite-Chat, but it failed.
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1759, in execute_model
get_kv_transfer_group().send_kv_caches_and_hidden_states(
File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 61, in send_kv_caches_and_hidden_states
self.connector.send_kv_caches_and_hidden_states(
File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/lmcache_connector.py", line 99, in send_kv_caches_and_hidden_states
store_status = self.lmcache_should_store(model_input)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/lmcache/integration/vllm/vllm_adapter.py", line 353, in lmcache_should_store
assert isinstance(model_input.attn_metadata, FlashAttentionMetadata), \
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Only FlashAttention backend is supported for now.
And I remove these assert, it almost passed except an error log.
loc("/usr/local/lib/python3.12/dist-packages/vllm/attention/ops/triton_decode_attention.py":311:16): error: operation scheduled before its operands
@chenqianfzh I have applied this patch to our test env and did a test of PD-disagg on
DeepSeek-V2-Lite-Chat, but it failed.File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1759, in execute_model get_kv_transfer_group().send_kv_caches_and_hidden_states( File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 61, in send_kv_caches_and_hidden_states self.connector.send_kv_caches_and_hidden_states( File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/lmcache_connector.py", line 99, in send_kv_caches_and_hidden_states store_status = self.lmcache_should_store(model_input) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/lmcache/integration/vllm/vllm_adapter.py", line 353, in lmcache_should_store assert isinstance(model_input.attn_metadata, FlashAttentionMetadata), \ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: Only FlashAttention backend is supported for now.And I remove these assert, it almost passed except an error log.
loc("/usr/local/lib/python3.12/dist-packages/vllm/attention/ops/triton_decode_attention.py":311:16): error: operation scheduled before its operands
The default backend of deepseek is triton MLA, instead of flash attention. We need to explicitly disable MLA to enable flash-attention backend, by setting env var VLLM_MLA_DISABLE=1.
Hope it helps.
@chenqianfzh Thanks for your kindly help, this is very useful for me, and I have run successfully vLLM+LMCache DP disagg by your document, thanks so much!
BTW, Whether LMCache can works with DeepSeekR1 after merge this PR?
Please disable MLA (use FlashAttn) for vllm + deepseek R1 + LMCache for PD disaggregating.
@chenqianfzh Thanks for your help, so LMCache still cannot work with deepseek with MLA? Is there a way to let LMCache support MLA ?
Hi @maobaolong, I’m occupied with other stuff. Me eta for this is by the end of this week.
@chenqianfzh Thanks for your help, so
LMCachestill cannot work with deepseek with MLA? Is there a way to letLMCachesupport MLA ?
@chenqianfzh and me are working on MLA memory layout on LMCache.
@chenqianfzh Thanks for your help, so
LMCachestill cannot work with deepseek with MLA? Is there a way to letLMCachesupport MLA ?
As the KV cache of MLA takes a different shape than FlashAttn, changes are necessary. I am actively working on it and hopefully will have a PR soon.
@KuntaiDu Thanks for your comments. I just replied. Could you take another look? Thanks.
@chenqianfzh When I disable MLA, the log still exist, maybe we should log this ERROR only when enable MLA?
ERROR LMCache: Failed to retrieve the hidden states. [2025-03-05 14:58:11,990] -- /usr/local/lib/python3.12/dist-packages/lmcache/experimental/cache_engine.py:189
@chenqianfzh When I disable MLA, the log still exist, maybe we should log this ERROR only when enable MLA?
ERROR LMCache: Failed to retrieve the hidden states. [2025-03-05 14:58:11,990] -- /usr/local/lib/python3.12/dist-packages/lmcache/experimental/cache_engine.py:189
Please follow https://github.com/bytedance-iaas/splitwise-demos
Add a new commit to fix the unnecessary hidden states store/retrieve.
@rainj-me The code looks good to me for now.
I just made lmcache compatible with chunked prefill with (1) vllm pr: vllm-project/vllm#14505 (2) lmcache pr: #392
Could you (1) refactor your code in vllm_adpter to make it compatible with chunked prefill and (2) fix the format checker and also unit tests?
Then, I can merge this PR.
Thanks!
Hi @YaoJiayi
Per your comments, I already merge the chunked prefill PR and fixed some bugs. Since the MLA PR is rely on this, please help to prioritize it.
Thanks.
@chenqianfzh Thanks for your kindly help, this is very useful for me, and I have run successfully vLLM+LMCache DP disagg by your document, thanks so much!
BTW, Whether LMCache can works with DeepSeekR1 after merge this PR?
@chenqianfzh Could you please help to give me a PD Disagg example which can run vllm+lmcache? Reference to the
PyNcclConnector, I have to specific arguments like the following@rainj-me is preparing a blog about the details, and he said he has talked to you offline about it.
However, I can share the setup I used in my tests:
In the producer side:
LMCACHE_USE_EXPERIMENTAL=True LMCACHE_CONFIG_FILE=/data/example.yaml vllm serve /data/models/Llama-3.1-8B-Instruct --port 7080 --trust-remote-code --enforce-eager --gpu-memory-utilization 0.9 --distributed-executor-backend mp --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_producer"}'In the soncusmer side ( it is in a different host):
LMCACHE_USE_EXPERIMENTAL=True LMCACHE_CONFIG_FILE=/data/example.yaml vllm serve /data/models/Llama-3.1-8B-Instruct --port 7080 --trust-remote-code --enforce-eager --gpu-memory-utilization 0.9 --distributed-executor-backend mp --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_consumer"}'and the lmcache config file is as:
chunk_size: 64 local_device: "cpu" remote_url: "lm://192.168.12.6:12080" remote_serde: "naive" enable_blending: False max_local_cpu_size: 80 # Whether retrieve() is pipelined or not pipelined_backend: FalseWe have own proxy, whose code is available in :
https://github.com/bd-iaas-us/vllm/blob/lmcache_connector_from072/vllm/distributed/kv_transfer/kv_proxy/proxy.py
Please let me know if anything else is needed.
Hi @chenqianfzh, thank you for your sharing. However, I encountered an error when following your document. Could you please help me check where the problem lies? I would be extremely grateful.
My script is: lmcache_server localhost 8300 & CUDA_VISIBLE_DEVICES=1,2 LMCACHE_USE_EXPERIMENTAL=True LMCACHE_CONFIG_FILE=lm_test.yaml vllm serve /workspace/Q_model/Qwen/merged_model_0403 --port 8100 --trust-remote-code --enforce-eager --gpu-memory-utilization 0.9 --max-model-len 10000 --distributed-executor-backend mp --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_producer"}' &
CUDA_VISIBLE_DEVICES=3,4 LMCACHE_USE_EXPERIMENTAL=True LMCACHE_CONFIG_FILE=lm_test.yaml vllm serve /workspace/Q_model/Qwen/merged_model_0403 --port 8200 --trust-remote-code --enforce-eager --gpu-memory-utilization 0.9 --max-model-len 10000 --distributed-executor-backend mp --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_consumer"}' &
wait_for_server 8100 wait_for_server 8200
python3 diss_lmcache.py &
diss_lmcache.py is:
import asyncio from fastapi import FastAPI, Request from fastapi.responses import StreamingResponse import httpx
Initialize the FastAPI app
app = FastAPI()
Base URLs for the two vLLM processes (set to the root of the API)
PREFILL_BASE_URLS = ["http://localhost:8100/v1"] DECODE_BASE_URLS = ["http://localhost:8200/v1"]
Initialize variables to hold the persistent clients
app.state.prefill_clients = None app.state.decode_clients = None
counter = 0
@app.on_event("startup") async def startup_event(): """ Initialize persistent HTTPX clients for vLLM services on startup. """ app.state.decode_clients = [httpx.AsyncClient(timeout=None, base_url=url) for url in DECODE_BASE_URLS] app.state.prefill_clients = [httpx.AsyncClient(timeout=None, base_url=url) for url in PREFILL_BASE_URLS]
@app.on_event("shutdown") async def shutdown_event(): """ Close the persistent HTTPX clients on shutdown. """ for prefill_client in app.state.prefill_clients: await prefill_client.aclose()
for decode_client in app.state.decode_clients:
await decode_client.aclose()
async def send_request_to_vllm(client: httpx.AsyncClient, req_data: dict): """ Send a request to a vLLM process using a persistent client. """ req_data = req_data.copy() # print(f"req_data: {req_data}") req_data['max_tokens'] = 1 req_data['max_completion_tokens'] = 1 response = await client.post("/chat/completions", json=req_data) # Correct endpoint path response.raise_for_status() return response
async def stream_vllm_response(client: httpx.AsyncClient, req_data: dict): """ Asynchronously stream the response from a vLLM process using a persistent client.
Args:
client (httpx.AsyncClient): The persistent HTTPX client.
req_data (dict): The JSON payload to send.
Yields:
bytes: Chunks of the response data.
"""
async with client.stream(
"POST", "/chat/completions",
json=req_data) as response: # Correct endpoint path
response.raise_for_status()
async for chunk in response.aiter_bytes():
yield chunk
@app.post("/v1/chat/completions") async def proxy_request(request: Request): global counter """ Proxy endpoint that forwards requests to two vLLM services.
Args:
request (Request): The incoming HTTP request.
Returns:
StreamingResponse: The streamed response from the second vLLM service.
"""
counter += 1
req_data = await request.json()
try:
prefill_client = app.state.prefill_clients[counter % len(app.state.prefill_clients)]
# Send request to prefill worker, ignore the response
await send_request_to_vllm(prefill_client, req_data)
decode_client = app.state.decode_clients[counter % len(app.state.decode_clients)]
# Stream response from decode worker
async def generate_stream():
async for chunk in stream_vllm_response(decode_client, req_data):
yield chunk
return StreamingResponse(generate_stream(),
media_type="application/json")
except Exception as e:
print(f"Error streaming response from vLLM-2: {e}")
raise
if name == "main": import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8810)
The error message is: INFO: 127.0.0.1:56540 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi result = await app( # type: ignore[func-returns-value] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in call return await self.app(scope, receive, send) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 112, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 187, in call raise exc File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 165, in call await self.app(scope, receive, _send) File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 62, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app raise exc File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 715, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 735, in app await route.handle(scope, receive, send) File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 288, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 76, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app raise exc File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 73, in app response = await f(request) ^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 301, in app raw_response = await run_endpoint_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 212, in run_endpoint_function return await dependant.call(**values) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/Q_model/diss_lmcache.py", line 91, in proxy_request await send_request_to_vllm(prefill_client, req_data) File "/workspace/Q_model/diss_lmcache.py", line 51, in send_request_to_vllm response.raise_for_status() File "/usr/local/lib/python3.12/dist-packages/httpx/_models.py", line 829, in raise_for_status raise HTTPStatusError(message, request=request, response=self) httpx.HTTPStatusError: Client error '400 Bad Request' for url 'http://localhost:8100/v1/chat/completions' For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400
Executing the command is: curl -X POST http://localhost:8810/v1/chat/completions -H "Content-Type: applic "model": "/workspace/Q_model/Qwen/merged_model_0403/", "prompt": "Explain the significance of KV cache in language models.", "max_tokens": 10 }'
@maobaolong Could you please take a look at my mistakes for me?I set it up in accordance with the configuration mentioned above.
Error streaming response from vLLM-2: Server error '500 Internal Server Error' for url 'http://localhost:8100/v1/chat/completions'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500
INFO: 127.0.0.1:41052 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR 04-17 03:20:14 engine.py:137] error('Error in model execution (input dumped to /tmp/err_execute_model_input_20250417-032014.pkl): unpack requires a buffer of 32 bytes')
ERROR 04-17 03:20:14 engine.py:137] Traceback (most recent call last):
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 04-17 03:20:14 engine.py:137] return func(*args, **kwargs)
ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1733, in execute_model
ERROR 04-17 03:20:14 engine.py:137] get_kv_transfer_group().send_kv_caches_and_hidden_states(
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 60, in send_kv_caches_and_hidden_states
ERROR 04-17 03:20:14 engine.py:137] self.connector.send_kv_caches_and_hidden_states(
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/lmcache_connector.py", line 94, in send_kv_caches_and_hidden_states
ERROR 04-17 03:20:14 engine.py:137] self.lmcache_store_kv(
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 116, in inner
ERROR 04-17 03:20:14 engine.py:137] result = func(*args, **kwargs)
ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/integration/vllm/vllm_adapter.py", line 500, in lmcache_store_kv
ERROR 04-17 03:20:14 engine.py:137] skip_leading_tokens = engine.lookup(current_tokens)
ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/cache_engine.py", line 207, in lookup
ERROR 04-17 03:20:14 engine.py:137] if not self.storage_manager.contains(key, search_range):
ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/storage_backend/storage_manager.py", line 327, in contains
ERROR 04-17 03:20:14 engine.py:137] if backend.contains(key):
ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/storage_backend/remote_backend.py", line 57, in contains
ERROR 04-17 03:20:14 engine.py:137] return future.result()
ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^
ERROR 04-17 03:20:14 engine.py:137] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 456, in result
ERROR 04-17 03:20:14 engine.py:137] return self.__get_result()
ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^
ERROR 04-17 03:20:14 engine.py:137] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
ERROR 04-17 03:20:14 engine.py:137] raise self._exception
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/storage_backend/connector/lm_connector.py", line 81, in exists
ERROR 04-17 03:20:14 engine.py:137] return (ServerMetaMessage.deserialize(response).code ==
ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/protocol.py", line 170, in deserialize
ERROR 04-17 03:20:14 engine.py:137] struct.unpack("iiiiiiii", s)
ERROR 04-17 03:20:14 engine.py:137] struct.error: unpack requires a buffer of 32 bytes
ERROR 04-17 03:20:14 engine.py:137]
ERROR 04-17 03:20:14 engine.py:137] The above exception was the direct cause of the following exception:
ERROR 04-17 03:20:14 engine.py:137]
ERROR 04-17 03:20:14 engine.py:137] Traceback (most recent call last):
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 135, in start
ERROR 04-17 03:20:14 engine.py:137] self.run_engine_loop()
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 198, in run_engine_loop
ERROR 04-17 03:20:14 engine.py:137] request_outputs = self.engine_step()
ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 216, in engine_step
ERROR 04-17 03:20:14 engine.py:137] raise e
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 207, in engine_step
ERROR 04-17 03:20:14 engine.py:137] return self.engine.step()
ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1369, in step
ERROR 04-17 03:20:14 engine.py:137] outputs = self.model_executor.execute_model(
ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 136, in execute_model
ERROR 04-17 03:20:14 engine.py:137] output = self.collective_rpc("execute_model",
ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 49, in collective_rpc
ERROR 04-17 03:20:14 engine.py:137] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2208, in run_method
ERROR 04-17 03:20:14 engine.py:137] return func(*args, **kwargs)
ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 410, in execute_model
ERROR 04-17 03:20:14 engine.py:137] output = self.model_runner.execute_model(
ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 03:20:14 engine.py:137] return func(*args, **kwargs)
ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
ERROR 04-17 03:20:14 engine.py:137] raise type(err)(
ERROR 04-17 03:20:14 engine.py:137] struct.error: Error in model execution (input dumped to /tmp/err_execute_model_input_20250417-032014.pkl): unpack requires a buffer of 32 bytes
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi
result = await app( # type: ignore[func-returns-value]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in call
return await self.app(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1054, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 112, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 187, in call
raise exc
File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 165, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 62, in call
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
raise exc
File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 715, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 735, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 288, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 76, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
raise exc
File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 73, in app
response = await f(request)
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 301, in app
raw_response = await run_endpoint_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 212, in run_endpoint_function
return await dependant.call(**values)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/Q_model/diss_lmcache.py", line 91, in proxy_request
await send_request_to_vllm(prefill_client, req_data)
File "/workspace/Q_model/diss_lmcache.py", line 51, in send_request_to_vllm
response.raise_for_status()
File "/usr/local/lib/python3.12/dist-packages/httpx/_models.py", line 829, in raise_for_status
raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Server error '500 Internal Server Error' for url 'http://localhost:8100/v1/chat/completions'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [25415]
[rank0]: Traceback (most recent call last):
[rank0]: File "
@chenqianfzh @maobaolong @rainj-me Could you please share your proxy.py file with me?
@chenqianfzh @maobaolong @rainj-me Could you please share your proxy.py file with me?
https://github.com/bd-iaas-us/vllm/blob/lmcache_connector_from072/vllm/distributed/kv_transfer/kv_proxy/proxy.py
Here you are
@chenqianfzh @maobaolong @rainj-me Could you please share your proxy.py file with me?
https://github.com/bd-iaas-us/vllm/blob/lmcache_connector_from072/vllm/distributed/kv_transfer/kv_proxy/proxy.py
Here you are @maobaolong Thank you for sharing. I have checked this file and it is the same as what I am currently using. I have used the official mirror environment and then applied PD separation within the mirror, but I encountered an error. Could you please help me figure out where the problem lies? Thank you very much.
"POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR 04-17 04:03:25 engine.py:137] error('Error in model execution (input dumped to /tmp/err_execute_model_input_20250417-040325.pkl): unpack requires a buffer of 32 bytes')
ERROR 04-17 04:03:25 engine.py:137] Traceback (most recent call last):
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 04-17 04:03:25 engine.py:137] return func(*args, **kwargs)
ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1733, in execute_model
ERROR 04-17 04:03:25 engine.py:137] get_kv_transfer_group().send_kv_caches_and_hidden_states(
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 60, in send_kv_caches_and_hidden_states
ERROR 04-17 04:03:25 engine.py:137] self.connector.send_kv_caches_and_hidden_states(
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/lmcache_connector.py", line 94, in send_kv_caches_and_hidden_states
ERROR 04-17 04:03:25 engine.py:137] self.lmcache_store_kv(
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 116, in inner
ERROR 04-17 04:03:25 engine.py:137] result = func(*args, **kwargs)
ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/integration/vllm/vllm_adapter.py", line 500, in lmcache_store_kv
ERROR 04-17 04:03:25 engine.py:137] skip_leading_tokens = engine.lookup(current_tokens)
ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/cache_engine.py", line 207, in lookup
ERROR 04-17 04:03:25 engine.py:137] if not self.storage_manager.contains(key, search_range):
ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/storage_backend/storage_manager.py", line 327, in contains
ERROR 04-17 04:03:25 engine.py:137] if backend.contains(key):
ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/storage_backend/remote_backend.py", line 57, in contains
ERROR 04-17 04:03:25 engine.py:137] return future.result()
ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^
ERROR 04-17 04:03:25 engine.py:137] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 456, in result
ERROR 04-17 04:03:25 engine.py:137] return self.__get_result()
ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^
ERROR 04-17 04:03:25 engine.py:137] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
ERROR 04-17 04:03:25 engine.py:137] raise self._exception
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/storage_backend/connector/lm_connector.py", line 81, in exists
ERROR 04-17 04:03:25 engine.py:137] return (ServerMetaMessage.deserialize(response).code ==
ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/protocol.py", line 170, in deserialize
ERROR 04-17 04:03:25 engine.py:137] struct.unpack("iiiiiiii", s)
ERROR 04-17 04:03:25 engine.py:137] struct.error: unpack requires a buffer of 32 bytes
ERROR 04-17 04:03:25 engine.py:137]
ERROR 04-17 04:03:25 engine.py:137] The above exception was the direct cause of the following exception:
ERROR 04-17 04:03:25 engine.py:137]
ERROR 04-17 04:03:25 engine.py:137] Traceback (most recent call last):
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 135, in start
ERROR 04-17 04:03:25 engine.py:137] self.run_engine_loop()
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 198, in run_engine_loop
ERROR 04-17 04:03:25 engine.py:137] request_outputs = self.engine_step()
ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 216, in engine_step
ERROR 04-17 04:03:25 engine.py:137] raise e
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 207, in engine_step
ERROR 04-17 04:03:25 engine.py:137] return self.engine.step()
ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1369, in step
ERROR 04-17 04:03:25 engine.py:137] outputs = self.model_executor.execute_model(
ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 136, in execute_model
ERROR 04-17 04:03:25 engine.py:137] output = self.collective_rpc("execute_model",
ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 49, in collective_rpc
ERROR 04-17 04:03:25 engine.py:137] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2208, in run_method
ERROR 04-17 04:03:25 engine.py:137] return func(*args, **kwargs)
ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 410, in execute_model
ERROR 04-17 04:03:25 engine.py:137] output = self.model_runner.execute_model(
ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 04:03:25 engine.py:137] return func(*args, **kwargs)
ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
ERROR 04-17 04:03:25 engine.py:137] raise type(err)(
ERROR 04-17 04:03:25 engine.py:137] struct.error: Error in model execution (input dumped to /tmp/err_execute_model_input_20250417-040325.pkl): unpack requires a buffer of 32 bytes
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [29149]
[rank0]: Traceback (most recent call last):
[rank0]: File "
@chenqianfzh @rainj-me What is the status of the PR?
@hickeyma Hey Martin, I thought this PR is not needed since it will not be used with the latest vLLM anymore.
@chenqianfzh @rainj-me Please let us know if we can close this PR
This pull request has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.
This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it!