ValueError: Cannot get 31 free blocks from the pool
I encountered this error while my engines are inferencing:
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] Traceback (most recent call last):
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 701, in run_engine_core
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] engine_core.run_busy_loop()
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 728, in run_busy_loop
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] self._process_engine_step()
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 754, in _process_engine_step
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 283, in step
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] scheduler_output = self.scheduler.schedule()
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/core/sched/scheduler.py", line 471, in schedule
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] new_blocks = self.kv_cache_manager.allocate_slots(
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/core/kv_cache_manager.py", line 288, in allocate_slots
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] new_blocks = self.coordinator.allocate_new_blocks(
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/core/kv_cache_coordinator.py", line 112, in allocate_new_blocks
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] return tuple(
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] ^^^^^^
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/core/kv_cache_coordinator.py", line 113, in <genexpr>
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] manager.allocate_new_blocks(
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/core/single_type_kv_cache_manager.py", line 129, in allocate_new_blocks
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] new_blocks = self.block_pool.get_new_blocks(num_new_blocks)
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] File "/tmp/kvcached/kvcached/integration/vllm/patches.py", line 87, in get_new_blocks
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] raise ValueError(f"Cannot get {num_blocks} free blocks from the pool")
(EngineCore_DP0 pid=10271) ERROR 10-27 14:16:34 [core.py:710] ValueError: Cannot get 31 free blocks from the pool
Confirmed that I did not set gpu_memory_utilization. I am using qwen7B in an A100_80GB and hosting 3 engines on 1 GPU. Sometimes I also see the issue similar to https://github.com/ovg-project/kvcached/issues/191 and I think it may be due to similar root cause. @jiarong0907 @ivanium
I tried reducing max_num_batched_tokens and it worked. The default value was 8192 and I reduced it to 1024 == max_model_len. There is a reduction of speed, though.
Thanks for the details and the log! I agree this is related to #191. I think this is likely due to some race conditions when the LLM engine A checks available GPU memory first but allocates KV caches later. Some other engine B may allocate memory during this period so A cannot get memory. I'll need some time to investigate this, but I'll keep you updated.
I tried reducing max_num_batched_tokens and it worked. The default value was 8192 and I reduced it to 1024 == max_model_len. There is a reduction of speed, though.
Hi @alecngo, thanks for the issue! Could you provide more configuration details from when the bug occurs, aside from max_num_batched_tokens = 8k — for example, the dataset length (should it still be 200k as in issue #191?), the prompt and completion lengths, etc? That would be very helpful for diagnosing the bug.
The dataset length is still 200k. Some NDA so I would just quickly summarize the prompt as around 300 English words asking Qwen to say yes or no if a word is related/not-related to the image. So completion lengths should just be 1 token.
The dataset length is still 200k. Some NDA so I would just quickly summarize the prompt as around 300 English words asking Qwen to say yes or no if a word is related/not-related to the image. So completion lengths should just be 1 token.
Interesting. Thanks for the info. I will try to reproduce it first.