LMCache icon indicating copy to clipboard operation
LMCache copied to clipboard

[Bug] Failed to run inference on DeepSeek-V3.2-Exp

Open niceallen opened this issue 2 months ago • 1 comments

I use vLLM v0.11.0 with LMCache 0.3.9post2 to deploy DeepSeek-V3.2-Exp. Ray cluster with 2 node (4*GB200 each).

My config:

lmcache_config.yaml

chunk_size: 256
local_cpu: true
max_local_cpu_size: 5.0
remote_url: "redis://10.62.207.53:32628"
remote_serde: "naive"

vllm_server.sh

export RAY_CGRAPH_get_timeout=3000
export NCCL_DEBUG=DEBUG
export NCCL_DEBUG_SUBSYS=INFO
python3 -m vllm.entrypoints.openai.api_server \
        --model=/mnt/allen/models/deepseek-ai/DeepSeek-V3.2-Exp \
        --served-model-name=deepseek-ai/DeepSeek-V3.2-Exp \
        --tensor-parallel-size=4 \
        --pipeline-parallel-size=2 \
        --distributed-executor-backend=ray \
        --enable-expert-parallel \
        --trust-remote-code \
        --gpu-memory-utilization=0.9 \
        --enable-prefix-caching \
        --enable-chunked-prefill \
        --enable-auto-tool-choice \
        --tool-call-parser deepseek_v31 \
        --chat-template /mnt/allen/scripts/tool_template/tool_chat_template_deepseekv31.jinja \
        --reasoning-parser=deepseek_r1 \
        --kv-transfer-config='{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' \
        --enable-log-requests

Deployment succeeded, but I got the following error during inference:

(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] Traceback (most recent call last):
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 701, in run_engine_core
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 728, in run_busy_loop
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     self._process_engine_step()
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 754, in _process_engine_step
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 346, in step_with_batch_queue
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     model_output = self.execute_model_with_error_logging(
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 270, in execute_model_with_error_logging
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     raise err
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 261, in execute_model_with_error_logging
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     return model_fn(scheduler_output)
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 347, in <lambda>
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     lambda _: future.result(), scheduler_output)
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]               ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_distributed_executor.py", line 40, in result
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     outputs = [ref.get() for ref in self.refs]
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]                ^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/compiled_dag_ref.py", line 150, in get
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     return _process_return_vals(return_vals, True)
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/compiled_dag_ref.py", line 27, in _process_return_vals
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     raise val.as_instanceof_cause()
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] ray.exceptions.RayTaskError(ValueError): ray::RayWorkerWrapper.__ray_call__() (pid=112445, ip=172.31.235.177)
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_utils.py", line 136, in execute_model_ray
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     output = self.worker.model_runner.execute_model(
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     return func(*args, **kwargs)
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2296, in execute_model
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     self.maybe_get_kv_connector_output(scheduler_output) as
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     next(self.gen)
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 119, in _get_kv_connector_output
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     kv_connector.wait_for_save()
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py", line 89, in wait_for_save
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     self._lmcache_engine.wait_for_save()
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/lmcache-0.3.9.post2-py3.12-linux-aarch64.egg/lmcache/integration/vllm/vllm_v1_adapter.py", line 1232, in wait_for_save
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     self.lmcache_engine.store(
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     return func(*args, **kwargs)
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/lmcache-0.3.9.post2-py3.12-linux-aarch64.egg/lmcache/v1/cache_engine.py", line 292, in store
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     self.gpu_connector.batched_from_gpu(memory_objs, starts, ends, **kwargs)
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/lmcache-0.3.9.post2-py3.12-linux-aarch64.egg/lmcache/v1/gpu_connector.py", line 321, in batched_from_gpu
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     self.from_gpu(memory_obj, start, end, **kwargs)
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/lmcache-0.3.9.post2-py3.12-linux-aarch64.egg/lmcache/v1/gpu_connector.py", line 274, in from_gpu
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     kv_cache_pointers = self._initialize_pointers(self.kvcaches)
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/lmcache-0.3.9.post2-py3.12-linux-aarch64.egg/lmcache/v1/gpu_connector.py", line 171, in _initialize_pointers
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     self.kv_cache_pointers.numpy()[:] = [t.data_ptr() for t in kv_caches]
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] ValueError: could not broadcast input array from shape (62,) into shape (31,)

Is there anything wrong with my configuration? Please advise. Thanks~

niceallen avatar Nov 12 '25 14:11 niceallen

👀, in theory, are we suppose to support DSA already ?

panpan0000 avatar Nov 27 '25 08:11 panpan0000

@panpan0000 Yeah, it is necessary!

maobaolong avatar Dec 08 '25 07:12 maobaolong

@panpan0000 @niceallen We have reproduced this issue.

maobaolong avatar Dec 09 '25 01:12 maobaolong

@maobaolong Thank you~ It seems to be related to sparse attention, which adds an indexer, thereby doubling the kv array size (kv_cache_pointers). https://blog.vllm.ai/2025/09/29/deepseek-v3-2.html

niceallen avatar Dec 09 '25 02:12 niceallen

Yeah, otherwise, the dtype and shape of those indexers are different with origin normal layers.

maobaolong avatar Dec 09 '25 06:12 maobaolong

#2215 #2219 #2220 #2230 These PRs are trying to fix this.

maobaolong avatar Dec 14 '25 06:12 maobaolong

@niceallen @panpan0000 The above PRs are all merged, but now we only support LocalCpuBackend for DeepSeek-v3.2, would you like to have a try?

BTW, the RemoteBackend DeepSeek-v3.2 is on the way.

maobaolong avatar Dec 17 '25 01:12 maobaolong

@maobaolong Thank for the solution! Yesterday, I attempted to build lmcache on an ARM64 (aarch64) environment using the vLLM nightly image, but it failed. I’d also like to ask if there is an estimated timeline for supporting a RemoteBackend, as I’m planning to use Redis.

niceallen avatar Dec 17 '25 02:12 niceallen

@niceallen Today, we will submit a PR, feel free to cherry-pick and have a trial.

maobaolong avatar Dec 17 '25 02:12 maobaolong