kvcached ValueError: Cannot get 31 free blocks from the pool

I encountered this error while my engines are inferencing:

(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] Traceback (most recent call last):
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 701, in run_engine_core
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 728, in run_busy_loop
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]     self._process_engine_step()
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 754, in _process_engine_step
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 283, in step
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]     scheduler_output = self.scheduler.schedule()
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]                        ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/core/sched/scheduler.py", line 471, in schedule
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]     new_blocks = self.kv_cache_manager.allocate_slots(
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/core/kv_cache_manager.py", line 288, in allocate_slots
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]     new_blocks = self.coordinator.allocate_new_blocks(
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/core/kv_cache_coordinator.py", line 112, in allocate_new_blocks
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]     return tuple(
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]            ^^^^^^
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/core/kv_cache_coordinator.py", line 113, in <genexpr>
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]     manager.allocate_new_blocks(
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/core/single_type_kv_cache_manager.py", line 129, in allocate_new_blocks
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]     new_blocks = self.block_pool.get_new_blocks(num_new_blocks)
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]   File "/tmp/kvcached/kvcached/integration/vllm/patches.py", line 87, in get_new_blocks
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710]     raise ValueError(f"Cannot get {num_blocks} free blocks from the pool")
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] ValueError: Cannot get 31 free blocks from the pool

Confirmed that I did not set gpu_memory_utilization. I am using qwen7B in an A100_80GB and hosting 3 engines on 1 GPU. Sometimes I also see the issue similar to https://github.com/ovg-project/kvcached/issues/191 and I think it may be due to similar root cause. @jiarong0907 @ivanium

Oct 27 '25 21:10 alecngo

I tried reducing max_num_batched_tokens and it worked. The default value was 8192 and I reduced it to 1024 == max_model_len. There is a reduction of speed, though.

Oct 27 '25 22:10 alecngo

Thanks for the details and the log! I agree this is related to #191. I think this is likely due to some race conditions when the LLM engine A checks available GPU memory first but allocates KV caches later. Some other engine B may allocate memory during this period so A cannot get memory. I'll need some time to investigate this, but I'll keep you updated.

Oct 27 '25 23:10 ivanium

I tried reducing max_num_batched_tokens and it worked. The default value was 8192 and I reduced it to 1024 == max_model_len. There is a reduction of speed, though.

Hi @alecngo, thanks for the issue! Could you provide more configuration details from when the bug occurs, aside from max_num_batched_tokens = 8k — for example, the dataset length (should it still be 200k as in issue #191?), the prompt and completion lengths, etc? That would be very helpful for diagnosing the bug.

Nov 01 '25 07:11 cui36

The dataset length is still 200k. Some NDA so I would just quickly summarize the prompt as around 300 English words asking Qwen to say yes or no if a word is related/not-related to the image. So completion lengths should just be 1 token.

Nov 03 '25 19:11 alecngo

The dataset length is still 200k. Some NDA so I would just quickly summarize the prompt as around 300 English words asking Qwen to say yes or no if a word is related/not-related to the image. So completion lengths should just be 1 token.

Interesting. Thanks for the info. I will try to reproduce it first.

Nov 04 '25 03:11 cui36