[Bug] Failed to run inference on DeepSeek-V3.2-Exp
I use vLLM v0.11.0 with LMCache 0.3.9post2 to deploy DeepSeek-V3.2-Exp. Ray cluster with 2 node (4*GB200 each).
My config:
lmcache_config.yaml
chunk_size: 256
local_cpu: true
max_local_cpu_size: 5.0
remote_url: "redis://10.62.207.53:32628"
remote_serde: "naive"
vllm_server.sh
export RAY_CGRAPH_get_timeout=3000
export NCCL_DEBUG=DEBUG
export NCCL_DEBUG_SUBSYS=INFO
python3 -m vllm.entrypoints.openai.api_server \
--model=/mnt/allen/models/deepseek-ai/DeepSeek-V3.2-Exp \
--served-model-name=deepseek-ai/DeepSeek-V3.2-Exp \
--tensor-parallel-size=4 \
--pipeline-parallel-size=2 \
--distributed-executor-backend=ray \
--enable-expert-parallel \
--trust-remote-code \
--gpu-memory-utilization=0.9 \
--enable-prefix-caching \
--enable-chunked-prefill \
--enable-auto-tool-choice \
--tool-call-parser deepseek_v31 \
--chat-template /mnt/allen/scripts/tool_template/tool_chat_template_deepseekv31.jinja \
--reasoning-parser=deepseek_r1 \
--kv-transfer-config='{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' \
--enable-log-requests
Deployment succeeded, but I got the following error during inference:
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] Traceback (most recent call last):
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 701, in run_engine_core
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] engine_core.run_busy_loop()
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 728, in run_busy_loop
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] self._process_engine_step()
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 754, in _process_engine_step
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 346, in step_with_batch_queue
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] model_output = self.execute_model_with_error_logging(
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 270, in execute_model_with_error_logging
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] raise err
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 261, in execute_model_with_error_logging
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] return model_fn(scheduler_output)
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 347, in <lambda>
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] lambda _: future.result(), scheduler_output)
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_distributed_executor.py", line 40, in result
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] outputs = [ref.get() for ref in self.refs]
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] ^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/local/lib/python3.12/dist-packages/ray/experimental/compiled_dag_ref.py", line 150, in get
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] return _process_return_vals(return_vals, True)
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/local/lib/python3.12/dist-packages/ray/experimental/compiled_dag_ref.py", line 27, in _process_return_vals
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] raise val.as_instanceof_cause()
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] ray.exceptions.RayTaskError(ValueError): ray::RayWorkerWrapper.__ray_call__() (pid=112445, ip=172.31.235.177)
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_utils.py", line 136, in execute_model_ray
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] output = self.worker.model_runner.execute_model(
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] return func(*args, **kwargs)
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2296, in execute_model
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] self.maybe_get_kv_connector_output(scheduler_output) as
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] next(self.gen)
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 119, in _get_kv_connector_output
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] kv_connector.wait_for_save()
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py", line 89, in wait_for_save
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] self._lmcache_engine.wait_for_save()
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/local/lib/python3.12/dist-packages/lmcache-0.3.9.post2-py3.12-linux-aarch64.egg/lmcache/integration/vllm/vllm_v1_adapter.py", line 1232, in wait_for_save
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] self.lmcache_engine.store(
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] return func(*args, **kwargs)
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/local/lib/python3.12/dist-packages/lmcache-0.3.9.post2-py3.12-linux-aarch64.egg/lmcache/v1/cache_engine.py", line 292, in store
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] self.gpu_connector.batched_from_gpu(memory_objs, starts, ends, **kwargs)
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/local/lib/python3.12/dist-packages/lmcache-0.3.9.post2-py3.12-linux-aarch64.egg/lmcache/v1/gpu_connector.py", line 321, in batched_from_gpu
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] self.from_gpu(memory_obj, start, end, **kwargs)
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/local/lib/python3.12/dist-packages/lmcache-0.3.9.post2-py3.12-linux-aarch64.egg/lmcache/v1/gpu_connector.py", line 274, in from_gpu
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] kv_cache_pointers = self._initialize_pointers(self.kvcaches)
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] File "/usr/local/lib/python3.12/dist-packages/lmcache-0.3.9.post2-py3.12-linux-aarch64.egg/lmcache/v1/gpu_connector.py", line 171, in _initialize_pointers
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] self.kv_cache_pointers.numpy()[:] = [t.data_ptr() for t in kv_caches]
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
(EngineCore_DP0 pid=112250) ERROR 11-12 06:47:03 [core.py:710] ValueError: could not broadcast input array from shape (62,) into shape (31,)
Is there anything wrong with my configuration? Please advise. Thanks~
👀, in theory, are we suppose to support DSA already ?
@panpan0000 Yeah, it is necessary!
@panpan0000 @niceallen We have reproduced this issue.
@maobaolong Thank you~
It seems to be related to sparse attention, which adds an indexer, thereby doubling the kv array size (kv_cache_pointers).
https://blog.vllm.ai/2025/09/29/deepseek-v3-2.html
Yeah, otherwise, the dtype and shape of those indexers are different with origin normal layers.
#2215 #2219 #2220 #2230 These PRs are trying to fix this.
@niceallen @panpan0000 The above PRs are all merged, but now we only support LocalCpuBackend for DeepSeek-v3.2, would you like to have a try?
BTW, the RemoteBackend DeepSeek-v3.2 is on the way.
@maobaolong Thank for the solution! Yesterday, I attempted to build lmcache on an ARM64 (aarch64) environment using the vLLM nightly image, but it failed. I’d also like to ask if there is an estimated timeline for supporting a RemoteBackend, as I’m planning to use Redis.
@niceallen Today, we will submit a PR, feel free to cherry-pick and have a trial.