This PR allows vllm lmcache connector to store/retrieve hidden_states in PD disaggregation. So the first iteration in the consumer side does not need to re-compute.

End-to-End verified with LLama as well as DeepSeek.

@KuntaiDu @YaoJiayi

Feb 27 '25 01:02 chenqianfzh

This commit is trying to transfer the hiddenstate from prefill to decode via LMCache (with remote kv cache storage) within following diagram

+----------------+                  +----------------+
|                |                  |                |
|    Prefill     |                  |     Decode     |
|                |                  |                |
+----------------+                  +----------------+
|                |                  |                |
|    LMCache     |                  |    LMCache     |
|                |                  |                |
+------+---------+                  +--+-------------+
       |                               |
       | put                           | get
       | Kv Cache                      | Kv Cache
       | Hidden States                 | Hidden State
       v                               v
+------+-------------------------------+-------------+
|                                                    |
|                                                    |
|                Remote Kv Cache Storage             |
|                                                    |
|                                                    |
+----------------------------------------------------+

Without this commit, the last token will be recompute on the Decode node, which will be prioritized by vllm scheduler and will block the decode forward.

The recompute log info on decode worker is like

DEBUG LMCache: Injected token number: 2133 [2025-02-25 23:16:07,504] -- /data/rain_dev/LMCache/lmcache/integration/vllm/vllm_adapter.py:740

DEBUG LMCache: Rebuilt the input! [2025-02-25 23:16:07,508] -- /data/rain_dev/LMCache/lmcache/integration/vllm/vllm_adapter.py:768

Feb 27 '25 02:02 rainj-me

@chenqianfzh Thanks for your effort! Is the deepseek you motioned is R1? Would you like to provide an Image(maybe push to dockerhub) which can run vllm and DeepSeekV3? I would like to test it in our environment.

Feb 27 '25 17:02 maobaolong

@chenqianfzh @rainj-me Just curious, how much overhead will it introduce if we do not save KV cache but let decoding instance to decode 1 token

Feb 28 '25 21:02 ApostaC

@chenqianfzh @rainj-me Just curious, how much overhead will it introduce if we do not save KV cache but let decoding instance to decode 1 token

The problem is the last token prefill will be scheduled by vllm with higher priority and will block the decode forward sequence. This will lead to both increase the TTFT and TPOT latency.

Feb 28 '25 22:02 rainj-me

@YaoJiayi Thanks for your comments. Wonder whether you can take another look as I updated my PR? Thanks.

Mar 01 '25 01:03 chenqianfzh

Those applications who uses disaggregated prefill are typically sensitive to latency. Decode one token basically introduces one inter-token latency, which is 10-20ms. So I agree with @rainj-me that LMCache should handle hidden states.

Mar 01 '25 03:03 KuntaiDu

@chenqianfzh Could you please help to give me a PD Disagg example which can run vllm+lmcache?

Reference to the PyNcclConnector, I have to specific arguments like the following

# First node
python3 -m vllm.entrypoints.openai.api_server --dtype=half --model /root/.cache/huggingface \
--trust-remote-code --served-model-name Qwen2.5-1.5B-Instruct \
--port 8100 --max-model-len 10000 --gpu-memory-utilization 0.6 \
--kv-transfer-config '{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":5e9,"kv_ip":"192.168.1.243"}'

# Second node
python3 -m vllm.entrypoints.openai.api_server --dtype=half --model /root/.cache/huggingface \
--trust-remote-code --served-model-name Qwen2.5-1.5B-Instruct \
--port 8100 --max-model-len 10000 --gpu-memory-utilization 0.6 \
--kv-transfer-config '{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2,"kv_buffer_size":5e9,"kv_ip":"192.168.1.243"}'

# Start a proxy server to route request to P and D

https://github.com/vllm-project/vllm/blob/bc6ccb987877000ec271e0076317b03a66cde4bc/benchmarks/disagg_benchmarks/disagg_prefill_proxy_server.py

# Send request via http

curl http://localhost:8100/v1/chat/completions     -H "Content-Type: application/json"     -d '{
      "model": "Qwen2.5-1.5B-Instruct",
      "messages": [{"role": "user", "content": "How to sleep fast?"}]
    }'

Mar 02 '25 11:03 maobaolong

@chenqianfzh Could you please help to give me a PD Disagg example which can run vllm+lmcache?

Reference to the PyNcclConnector, I have to specific arguments like the following

@rainj-me is preparing a blog about the details, and he said he has talked to you offline about it.

However, I can share the setup I used in my tests:

In the producer side:

LMCACHE_USE_EXPERIMENTAL=True LMCACHE_CONFIG_FILE=/data/example.yaml vllm serve /data/models/Llama-3.1-8B-Instruct --port 7080  --trust-remote-code --enforce-eager --gpu-memory-utilization 0.9  --distributed-executor-backend mp --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_producer"}'

In the soncusmer side ( it is in a different host):


LMCACHE_USE_EXPERIMENTAL=True LMCACHE_CONFIG_FILE=/data/example.yaml vllm serve /data/models/Llama-3.1-8B-Instruct --port 7080 --trust-remote-code --enforce-eager --gpu-memory-utilization 0.9  --distributed-executor-backend mp  --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_consumer"}'

and the lmcache config file is as:

chunk_size: 64
local_device: "cpu"
remote_url: "lm://192.168.12.6:12080"
remote_serde: "naive"
enable_blending: False
max_local_cpu_size: 80

# Whether retrieve() is pipelined or not
pipelined_backend: False

We have own proxy, whose code is available in :

https://github.com/bd-iaas-us/vllm/blob/lmcache_connector_from072/vllm/distributed/kv_transfer/kv_proxy/proxy.py

Please let me know if anything else is needed.

Mar 03 '25 04:03 chenqianfzh

@chenqianfzh Thanks for your kindly help, this is very useful for me, and I have run successfully vLLM+LMCache DP disagg by your document, thanks so much!

BTW, Whether LMCache can works with DeepSeekR1 after merge this PR?

Mar 03 '25 12:03 maobaolong

@chenqianfzh I have applied this patch to our test env and did a test of PD-disagg on DeepSeek-V2-Lite-Chat, but it failed.

  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1759, in execute_model
    get_kv_transfer_group().send_kv_caches_and_hidden_states(
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 61, in send_kv_caches_and_hidden_states
    self.connector.send_kv_caches_and_hidden_states(
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/lmcache_connector.py", line 99, in send_kv_caches_and_hidden_states
    store_status = self.lmcache_should_store(model_input)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lmcache/integration/vllm/vllm_adapter.py", line 353, in lmcache_should_store
    assert isinstance(model_input.attn_metadata, FlashAttentionMetadata), \
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Only FlashAttention backend is supported for now.

And I remove these assert, it almost passed except an error log.

loc("/usr/local/lib/python3.12/dist-packages/vllm/attention/ops/triton_decode_attention.py":311:16): error: operation scheduled before its operands

Mar 03 '25 16:03 maobaolong

@chenqianfzh I have applied this patch to our test env and did a test of PD-disagg on DeepSeek-V2-Lite-Chat, but it failed.

  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1759, in execute_model
    get_kv_transfer_group().send_kv_caches_and_hidden_states(
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 61, in send_kv_caches_and_hidden_states
    self.connector.send_kv_caches_and_hidden_states(
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/lmcache_connector.py", line 99, in send_kv_caches_and_hidden_states
    store_status = self.lmcache_should_store(model_input)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lmcache/integration/vllm/vllm_adapter.py", line 353, in lmcache_should_store
    assert isinstance(model_input.attn_metadata, FlashAttentionMetadata), \
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Only FlashAttention backend is supported for now.

And I remove these assert, it almost passed except an error log.

loc("/usr/local/lib/python3.12/dist-packages/vllm/attention/ops/triton_decode_attention.py":311:16): error: operation scheduled before its operands

The default backend of deepseek is triton MLA, instead of flash attention. We need to explicitly disable MLA to enable flash-attention backend, by setting env var VLLM_MLA_DISABLE=1.

Hope it helps.

Mar 03 '25 17:03 chenqianfzh

@chenqianfzh Thanks for your kindly help, this is very useful for me, and I have run successfully vLLM+LMCache DP disagg by your document, thanks so much!

BTW, Whether LMCache can works with DeepSeekR1 after merge this PR?

Please disable MLA (use FlashAttn) for vllm + deepseek R1 + LMCache for PD disaggregating.

Mar 03 '25 17:03 rainj-me

@chenqianfzh Thanks for your help, so LMCache still cannot work with deepseek with MLA? Is there a way to let LMCache support MLA ?

Mar 04 '25 08:03 maobaolong

Hi @maobaolong, I’m occupied with other stuff. Me eta for this is by the end of this week.

Mar 04 '25 08:03 YaoJiayi

@chenqianfzh Thanks for your help, so LMCache still cannot work with deepseek with MLA? Is there a way to let LMCache support MLA ?

@chenqianfzh and me are working on MLA memory layout on LMCache.

Mar 04 '25 17:03 rainj-me

@chenqianfzh Thanks for your help, so LMCache still cannot work with deepseek with MLA? Is there a way to let LMCache support MLA ?

As the KV cache of MLA takes a different shape than FlashAttn, changes are necessary. I am actively working on it and hopefully will have a PR soon.

Mar 04 '25 18:03 chenqianfzh

@KuntaiDu Thanks for your comments. I just replied. Could you take another look? Thanks.

Mar 05 '25 01:03 chenqianfzh

@chenqianfzh When I disable MLA, the log still exist, maybe we should log this ERROR only when enable MLA?

ERROR LMCache: Failed to retrieve the hidden states. [2025-03-05 14:58:11,990] -- /usr/local/lib/python3.12/dist-packages/lmcache/experimental/cache_engine.py:189

Mar 05 '25 07:03 maobaolong

@chenqianfzh When I disable MLA, the log still exist, maybe we should log this ERROR only when enable MLA?
ERROR LMCache: Failed to retrieve the hidden states. [2025-03-05 14:58:11,990] -- /usr/local/lib/python3.12/dist-packages/lmcache/experimental/cache_engine.py:189

Please follow https://github.com/bytedance-iaas/splitwise-demos

Mar 05 '25 18:03 rainj-me

Add a new commit to fix the unnecessary hidden states store/retrieve.

Mar 07 '25 23:03 rainj-me

@rainj-me The code looks good to me for now.

I just made lmcache compatible with chunked prefill with (1) vllm pr: vllm-project/vllm#14505 (2) lmcache pr: #392

Could you (1) refactor your code in vllm_adpter to make it compatible with chunked prefill and (2) fix the format checker and also unit tests?

Then, I can merge this PR.

Thanks!

Hi @YaoJiayi

Per your comments, I already merge the chunked prefill PR and fixed some bugs. Since the MLA PR is rely on this, please help to prioritize it.

Thanks.

Mar 14 '25 00:03 rainj-me

@chenqianfzh Thanks for your kindly help, this is very useful for me, and I have run successfully vLLM+LMCache DP disagg by your document, thanks so much!

BTW, Whether LMCache can works with DeepSeekR1 after merge this PR?

@chenqianfzh Could you please help to give me a PD Disagg example which can run vllm+lmcache? Reference to the PyNcclConnector, I have to specific arguments like the following

@rainj-me is preparing a blog about the details, and he said he has talked to you offline about it.

However, I can share the setup I used in my tests:

In the producer side:
LMCACHE_USE_EXPERIMENTAL=True LMCACHE_CONFIG_FILE=/data/example.yaml vllm serve /data/models/Llama-3.1-8B-Instruct --port 7080  --trust-remote-code --enforce-eager --gpu-memory-utilization 0.9  --distributed-executor-backend mp --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_producer"}'
In the soncusmer side ( it is in a different host):
LMCACHE_USE_EXPERIMENTAL=True LMCACHE_CONFIG_FILE=/data/example.yaml vllm serve /data/models/Llama-3.1-8B-Instruct --port 7080 --trust-remote-code --enforce-eager --gpu-memory-utilization 0.9  --distributed-executor-backend mp  --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_consumer"}'
and the lmcache config file is as:
chunk_size: 64
local_device: "cpu"
remote_url: "lm://192.168.12.6:12080"
remote_serde: "naive"
enable_blending: False
max_local_cpu_size: 80

# Whether retrieve() is pipelined or not
pipelined_backend: False
We have own proxy, whose code is available in :

https://github.com/bd-iaas-us/vllm/blob/lmcache_connector_from072/vllm/distributed/kv_transfer/kv_proxy/proxy.py

Please let me know if anything else is needed.

Hi @chenqianfzh, thank you for your sharing. However, I encountered an error when following your document. Could you please help me check where the problem lies? I would be extremely grateful.

My script is: lmcache_server localhost 8300 & CUDA_VISIBLE_DEVICES=1,2 LMCACHE_USE_EXPERIMENTAL=True LMCACHE_CONFIG_FILE=lm_test.yaml vllm serve /workspace/Q_model/Qwen/merged_model_0403 --port 8100 --trust-remote-code --enforce-eager --gpu-memory-utilization 0.9 --max-model-len 10000 --distributed-executor-backend mp --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_producer"}' &

CUDA_VISIBLE_DEVICES=3,4 LMCACHE_USE_EXPERIMENTAL=True LMCACHE_CONFIG_FILE=lm_test.yaml vllm serve /workspace/Q_model/Qwen/merged_model_0403 --port 8200 --trust-remote-code --enforce-eager --gpu-memory-utilization 0.9 --max-model-len 10000 --distributed-executor-backend mp --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_consumer"}' &

wait_for_server 8100 wait_for_server 8200

python3 diss_lmcache.py &

diss_lmcache.py is:

import asyncio from fastapi import FastAPI, Request from fastapi.responses import StreamingResponse import httpx

Initialize the FastAPI app

app = FastAPI()

Base URLs for the two vLLM processes (set to the root of the API)

PREFILL_BASE_URLS = ["http://localhost:8100/v1"] DECODE_BASE_URLS = ["http://localhost:8200/v1"]

Initialize variables to hold the persistent clients

app.state.prefill_clients = None app.state.decode_clients = None

counter = 0

@app.on_event("startup") async def startup_event(): """ Initialize persistent HTTPX clients for vLLM services on startup. """ app.state.decode_clients = [httpx.AsyncClient(timeout=None, base_url=url) for url in DECODE_BASE_URLS] app.state.prefill_clients = [httpx.AsyncClient(timeout=None, base_url=url) for url in PREFILL_BASE_URLS]

@app.on_event("shutdown") async def shutdown_event(): """ Close the persistent HTTPX clients on shutdown. """ for prefill_client in app.state.prefill_clients: await prefill_client.aclose()

for decode_client in app.state.decode_clients:
    await decode_client.aclose()

async def send_request_to_vllm(client: httpx.AsyncClient, req_data: dict): """ Send a request to a vLLM process using a persistent client. """ req_data = req_data.copy() # print(f"req_data: {req_data}") req_data['max_tokens'] = 1 req_data['max_completion_tokens'] = 1 response = await client.post("/chat/completions", json=req_data) # Correct endpoint path response.raise_for_status() return response

async def stream_vllm_response(client: httpx.AsyncClient, req_data: dict): """ Asynchronously stream the response from a vLLM process using a persistent client.

Args:
    client (httpx.AsyncClient): The persistent HTTPX client.
    req_data (dict): The JSON payload to send.

Yields:
    bytes: Chunks of the response data.
"""
async with client.stream(
        "POST", "/chat/completions",
        json=req_data) as response:  # Correct endpoint path
    response.raise_for_status()
    async for chunk in response.aiter_bytes():
        yield chunk

@app.post("/v1/chat/completions") async def proxy_request(request: Request): global counter """ Proxy endpoint that forwards requests to two vLLM services.

Args:
    request (Request): The incoming HTTP request.

Returns:
    StreamingResponse: The streamed response from the second vLLM service.
"""
counter += 1
req_data = await request.json()
try:
    prefill_client = app.state.prefill_clients[counter % len(app.state.prefill_clients)]
    # Send request to prefill worker, ignore the response
    await send_request_to_vllm(prefill_client, req_data)

    decode_client = app.state.decode_clients[counter % len(app.state.decode_clients)]
    # Stream response from decode worker
    async def generate_stream():
        async for chunk in stream_vllm_response(decode_client, req_data):
            yield chunk

    return StreamingResponse(generate_stream(),
                                    media_type="application/json")
except Exception as e:
    print(f"Error streaming response from vLLM-2: {e}")
    raise

if name == "main": import uvicorn

uvicorn.run(app, host="0.0.0.0", port=8810)

The error message is: INFO: 127.0.0.1:56540 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi result = await app( # type: ignore[func-returns-value] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in call return await self.app(scope, receive, send) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 112, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 187, in call raise exc File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 165, in call await self.app(scope, receive, _send) File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 62, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app raise exc File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 715, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 735, in app await route.handle(scope, receive, send) File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 288, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 76, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app raise exc File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 73, in app response = await f(request) ^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 301, in app raw_response = await run_endpoint_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 212, in run_endpoint_function return await dependant.call(**values) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/Q_model/diss_lmcache.py", line 91, in proxy_request await send_request_to_vllm(prefill_client, req_data) File "/workspace/Q_model/diss_lmcache.py", line 51, in send_request_to_vllm response.raise_for_status() File "/usr/local/lib/python3.12/dist-packages/httpx/_models.py", line 829, in raise_for_status raise HTTPStatusError(message, request=request, response=self) httpx.HTTPStatusError: Client error '400 Bad Request' for url 'http://localhost:8100/v1/chat/completions' For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400

Executing the command is: curl -X POST http://localhost:8810/v1/chat/completions -H "Content-Type: applic "model": "/workspace/Q_model/Qwen/merged_model_0403/", "prompt": "Explain the significance of KV cache in language models.", "max_tokens": 10 }'

Apr 17 '25 09:04 xidiancpy

@maobaolong Could you please take a look at my mistakes for me?I set it up in accordance with the configuration mentioned above.

Error streaming response from vLLM-2: Server error '500 Internal Server Error' for url 'http://localhost:8100/v1/chat/completions' For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500 INFO: 127.0.0.1:41052 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error ERROR 04-17 03:20:14 engine.py:137] error('Error in model execution (input dumped to /tmp/err_execute_model_input_20250417-032014.pkl): unpack requires a buffer of 32 bytes') ERROR 04-17 03:20:14 engine.py:137] Traceback (most recent call last): ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper ERROR 04-17 03:20:14 engine.py:137] return func(*args, **kwargs) ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1733, in execute_model ERROR 04-17 03:20:14 engine.py:137] get_kv_transfer_group().send_kv_caches_and_hidden_states( ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 60, in send_kv_caches_and_hidden_states ERROR 04-17 03:20:14 engine.py:137] self.connector.send_kv_caches_and_hidden_states( ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/lmcache_connector.py", line 94, in send_kv_caches_and_hidden_states ERROR 04-17 03:20:14 engine.py:137] self.lmcache_store_kv( ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 116, in inner ERROR 04-17 03:20:14 engine.py:137] result = func(*args, **kwargs) ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/integration/vllm/vllm_adapter.py", line 500, in lmcache_store_kv ERROR 04-17 03:20:14 engine.py:137] skip_leading_tokens = engine.lookup(current_tokens) ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/cache_engine.py", line 207, in lookup ERROR 04-17 03:20:14 engine.py:137] if not self.storage_manager.contains(key, search_range): ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/storage_backend/storage_manager.py", line 327, in contains ERROR 04-17 03:20:14 engine.py:137] if backend.contains(key): ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/storage_backend/remote_backend.py", line 57, in contains ERROR 04-17 03:20:14 engine.py:137] return future.result() ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^ ERROR 04-17 03:20:14 engine.py:137] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 456, in result ERROR 04-17 03:20:14 engine.py:137] return self.__get_result() ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^ ERROR 04-17 03:20:14 engine.py:137] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result ERROR 04-17 03:20:14 engine.py:137] raise self._exception ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/storage_backend/connector/lm_connector.py", line 81, in exists ERROR 04-17 03:20:14 engine.py:137] return (ServerMetaMessage.deserialize(response).code == ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/protocol.py", line 170, in deserialize ERROR 04-17 03:20:14 engine.py:137] struct.unpack("iiiiiiii", s) ERROR 04-17 03:20:14 engine.py:137] struct.error: unpack requires a buffer of 32 bytes ERROR 04-17 03:20:14 engine.py:137] ERROR 04-17 03:20:14 engine.py:137] The above exception was the direct cause of the following exception: ERROR 04-17 03:20:14 engine.py:137] ERROR 04-17 03:20:14 engine.py:137] Traceback (most recent call last): ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 135, in start ERROR 04-17 03:20:14 engine.py:137] self.run_engine_loop() ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 198, in run_engine_loop ERROR 04-17 03:20:14 engine.py:137] request_outputs = self.engine_step() ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^ ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 216, in engine_step ERROR 04-17 03:20:14 engine.py:137] raise e ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 207, in engine_step ERROR 04-17 03:20:14 engine.py:137] return self.engine.step() ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^ ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1369, in step ERROR 04-17 03:20:14 engine.py:137] outputs = self.model_executor.execute_model( ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 136, in execute_model ERROR 04-17 03:20:14 engine.py:137] output = self.collective_rpc("execute_model", ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 49, in collective_rpc ERROR 04-17 03:20:14 engine.py:137] answer = run_method(self.driver_worker, method, args, kwargs) ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2208, in run_method ERROR 04-17 03:20:14 engine.py:137] return func(*args, **kwargs) ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 410, in execute_model ERROR 04-17 03:20:14 engine.py:137] output = self.model_runner.execute_model( ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 04-17 03:20:14 engine.py:137] return func(*args, **kwargs) ERROR 04-17 03:20:14 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 03:20:14 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper ERROR 04-17 03:20:14 engine.py:137] raise type(err)( ERROR 04-17 03:20:14 engine.py:137] struct.error: Error in model execution (input dumped to /tmp/err_execute_model_input_20250417-032014.pkl): unpack requires a buffer of 32 bytes ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi result = await app( # type: ignore[func-returns-value] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in call return await self.app(scope, receive, send) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 112, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 187, in call raise exc File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 165, in call await self.app(scope, receive, _send) File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 62, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app raise exc File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 715, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 735, in app await route.handle(scope, receive, send) File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 288, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 76, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app raise exc File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 73, in app response = await f(request) ^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 301, in app raw_response = await run_endpoint_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 212, in run_endpoint_function return await dependant.call(**values) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/Q_model/diss_lmcache.py", line 91, in proxy_request await send_request_to_vllm(prefill_client, req_data) File "/workspace/Q_model/diss_lmcache.py", line 51, in send_request_to_vllm response.raise_for_status() File "/usr/local/lib/python3.12/dist-packages/httpx/_models.py", line 829, in raise_for_status raise HTTPStatusError(message, request=request, response=self) httpx.HTTPStatusError: Server error '500 Internal Server Error' for url 'http://localhost:8100/v1/chat/completions' For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500 INFO: Shutting down INFO: Waiting for application shutdown. INFO: Application shutdown complete. INFO: Finished server process [25415] [rank0]: Traceback (most recent call last): [rank0]: File "", line 1, in [rank0]: File "/usr/lib/python3.12/multiprocessing/spawn.py", line 122, in spawn_main [rank0]: exitcode = _main(fd, parent_sentinel) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/lib/python3.12/multiprocessing/spawn.py", line 135, in _main [rank0]: return self._bootstrap(parent_sentinel) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/lib/python3.12/multiprocessing/process.py", line 332, in _bootstrap [rank0]: threading._shutdown() [rank0]: File "/usr/lib/python3.12/threading.py", line 1624, in _shutdown [rank0]: lock.acquire() [rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 372, in signal_handler [rank0]: raise KeyboardInterrupt("MQLLMEngine terminated") [rank0]: KeyboardInterrupt: MQLLMEngine terminated [rank0]:[W417 03:20:14.693234507 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())

Apr 17 '25 10:04 xidiancpy

@chenqianfzh @maobaolong @rainj-me Could you please share your proxy.py file with me?

Apr 17 '25 10:04 xidiancpy

@chenqianfzh @maobaolong @rainj-me Could you please share your proxy.py file with me?

https://github.com/bd-iaas-us/vllm/blob/lmcache_connector_from072/vllm/distributed/kv_transfer/kv_proxy/proxy.py

Here you are

Apr 17 '25 17:04 maobaolong

@chenqianfzh @maobaolong @rainj-me Could you please share your proxy.py file with me?

https://github.com/bd-iaas-us/vllm/blob/lmcache_connector_from072/vllm/distributed/kv_transfer/kv_proxy/proxy.py

Here you are @maobaolong Thank you for sharing. I have checked this file and it is the same as what I am currently using. I have used the official mirror environment and then applied PD separation within the mirror, but I encountered an error. Could you please help me figure out where the problem lies? Thank you very much.

"POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error ERROR 04-17 04:03:25 engine.py:137] error('Error in model execution (input dumped to /tmp/err_execute_model_input_20250417-040325.pkl): unpack requires a buffer of 32 bytes') ERROR 04-17 04:03:25 engine.py:137] Traceback (most recent call last): ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper ERROR 04-17 04:03:25 engine.py:137] return func(*args, **kwargs) ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1733, in execute_model ERROR 04-17 04:03:25 engine.py:137] get_kv_transfer_group().send_kv_caches_and_hidden_states( ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 60, in send_kv_caches_and_hidden_states ERROR 04-17 04:03:25 engine.py:137] self.connector.send_kv_caches_and_hidden_states( ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/lmcache_connector.py", line 94, in send_kv_caches_and_hidden_states ERROR 04-17 04:03:25 engine.py:137] self.lmcache_store_kv( ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 116, in inner ERROR 04-17 04:03:25 engine.py:137] result = func(*args, **kwargs) ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/integration/vllm/vllm_adapter.py", line 500, in lmcache_store_kv ERROR 04-17 04:03:25 engine.py:137] skip_leading_tokens = engine.lookup(current_tokens) ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/cache_engine.py", line 207, in lookup ERROR 04-17 04:03:25 engine.py:137] if not self.storage_manager.contains(key, search_range): ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/storage_backend/storage_manager.py", line 327, in contains ERROR 04-17 04:03:25 engine.py:137] if backend.contains(key): ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/storage_backend/remote_backend.py", line 57, in contains ERROR 04-17 04:03:25 engine.py:137] return future.result() ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^ ERROR 04-17 04:03:25 engine.py:137] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 456, in result ERROR 04-17 04:03:25 engine.py:137] return self.__get_result() ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^ ERROR 04-17 04:03:25 engine.py:137] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result ERROR 04-17 04:03:25 engine.py:137] raise self._exception ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/storage_backend/connector/lm_connector.py", line 81, in exists ERROR 04-17 04:03:25 engine.py:137] return (ServerMetaMessage.deserialize(response).code == ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/lmcache/experimental/protocol.py", line 170, in deserialize ERROR 04-17 04:03:25 engine.py:137] struct.unpack("iiiiiiii", s) ERROR 04-17 04:03:25 engine.py:137] struct.error: unpack requires a buffer of 32 bytes ERROR 04-17 04:03:25 engine.py:137] ERROR 04-17 04:03:25 engine.py:137] The above exception was the direct cause of the following exception: ERROR 04-17 04:03:25 engine.py:137] ERROR 04-17 04:03:25 engine.py:137] Traceback (most recent call last): ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 135, in start ERROR 04-17 04:03:25 engine.py:137] self.run_engine_loop() ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 198, in run_engine_loop ERROR 04-17 04:03:25 engine.py:137] request_outputs = self.engine_step() ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^ ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 216, in engine_step ERROR 04-17 04:03:25 engine.py:137] raise e ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 207, in engine_step ERROR 04-17 04:03:25 engine.py:137] return self.engine.step() ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^ ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1369, in step ERROR 04-17 04:03:25 engine.py:137] outputs = self.model_executor.execute_model( ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 136, in execute_model ERROR 04-17 04:03:25 engine.py:137] output = self.collective_rpc("execute_model", ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 49, in collective_rpc ERROR 04-17 04:03:25 engine.py:137] answer = run_method(self.driver_worker, method, args, kwargs) ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2208, in run_method ERROR 04-17 04:03:25 engine.py:137] return func(*args, **kwargs) ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 410, in execute_model ERROR 04-17 04:03:25 engine.py:137] output = self.model_runner.execute_model( ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 04-17 04:03:25 engine.py:137] return func(*args, **kwargs) ERROR 04-17 04:03:25 engine.py:137] ^^^^^^^^^^^^^^^^^^^^^ ERROR 04-17 04:03:25 engine.py:137] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper ERROR 04-17 04:03:25 engine.py:137] raise type(err)( ERROR 04-17 04:03:25 engine.py:137] struct.error: Error in model execution (input dumped to /tmp/err_execute_model_input_20250417-040325.pkl): unpack requires a buffer of 32 bytes INFO: Shutting down INFO: Waiting for application shutdown. INFO: Application shutdown complete. INFO: Finished server process [29149] [rank0]: Traceback (most recent call last): [rank0]: File "", line 1, in [rank0]: File "/usr/lib/python3.12/multiprocessing/spawn.py", line 122, in spawn_main [rank0]: exitcode = _main(fd, parent_sentinel) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/lib/python3.12/multiprocessing/spawn.py", line 135, in _main [rank0]: return self._bootstrap(parent_sentinel) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/lib/python3.12/multiprocessing/process.py", line 332, in _bootstrap [rank0]: threading._shutdown() [rank0]: File "/usr/lib/python3.12/threading.py", line 1624, in _shutdown [rank0]: lock.acquire() [rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 372, in signal_handler [rank0]: raise KeyboardInterrupt("MQLLMEngine terminated") [rank0]: KeyboardInterrupt: MQLLMEngine terminated

Apr 18 '25 01:04 xidiancpy

@chenqianfzh @rainj-me What is the status of the PR?

Jul 07 '25 11:07 hickeyma

@hickeyma Hey Martin, I thought this PR is not needed since it will not be used with the latest vLLM anymore.

@chenqianfzh @rainj-me Please let us know if we can close this PR

Jul 07 '25 17:07 ApostaC

This pull request has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.

Sep 06 '25 02:09 github-actions[bot]

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it!

Oct 06 '25 02:10 github-actions[bot]

store/retrieve hidden states in PD Disagg

Initialize the FastAPI app

Base URLs for the two vLLM processes (set to the root of the API)

Initialize variables to hold the persistent clients