Running Llama3.1 + Llama3.3 70bs on 8 x A100s
Description
I'm running into an OOM/ValueError when attempting to launch a vLLM server with KVCached enabled for unsloth/Llama-3.1-70B-Instruct and unsloth/Llama-3.3-70B-Instruct
The server fails to start, reporting that the available KV cache memory is insufficient to handle a single request at the model's maximum sequence length (32k). This happens despite having a large amount of total VRAM (8x 40GB A100s, it seems).
I'm trying to understand if I am misunderstanding a fundamental concept about how KVCached interacts with vLLM's memory management, specifically regarding gpu_memory_utilization and where this needs to be set explicitly for each model, if at all.
Environment
Model: unsloth/Llama-3.1-70B-Instruct / unsloth/Llama-3.3-70B-Instruct
Hardware: 8x A100 GPUs (~40GB VRAM each)
vllm: 0.11.0
Steps to Reproduce
Set the following environment variables and run the vllm server
export ENABLE_KVCACHED=true
export KVCACHED_AUTOPATCH=1
export VLLM_USE_V1=1
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
vllm serve unsloth/Llama-3.1-70B-Instruct \
--disable-log-requests \
--no-enable-prefix-caching \
--port 8001 \
--tensor-parallel-size 8 \
--enable-sleep-mode &
vllm serve unsloth/Llama-3.3-70B-Instruct \
--disable-log-requests \
--no-enable-prefix-caching \
--port 8002 \
--tensor-parallel-size 8 \
--enable-sleep-mode &
Logs and Error Messages
First, I receive a KVCached warning about free memory being less than the desired utilization (which I assume defaults to 0.9):
[kvcached][WARNING][2025-10-30 16:31:46][patches.py:749] Ignoring GPU free-memory check: Free memory on device (20.08/39.49 GiB) on startup is less than desired GPU memory utilization (0.9, 35.55 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
This warning is interesting because it shows only ~20GiB is free at startup, while vLLM seems to expect ~35.5GiB (90% of 39.49 GiB).
Shortly after, the workers report very low available KV cache memory:
(Worker_TP4 pid=8229) INFO 10-30 16:33:15 [gpu_worker.py:298] Available KV cache memory: 1.21 GiB
(Worker_TP2 pid=8227) INFO 10-30 16:33:15 [gpu_worker.py:298] Available KV cache memory: 1.23 GiB
(Worker_TP6 pid=8231) INFO 10-30 16:33:15 [gpu_worker.py:298] Available KV cache memory: 1.09 GiB
(Worker_TP3 pid=8228) INFO 10-30 16:33:15 [gpu_worker.py:298] Available KV cache memory: 1.27 GiB
(Worker_TP1 pid=8226) INFO 10-30 16:33:15 [gpu_worker.py:298] Available KV cache memory: 1.16 GiB
(Worker_TP0 pid=8225) INFO 10-30 16:33:15 [gpu_worker.py:298] Available KV cache memory: 1.17 GiB
(Worker_TP7 pid=8232) INFO 10-30 16:33:15 [gpu_worker.py:298] Available KV cache memory: 1.09 GiB
(Worker_TP5 pid=8230) INFO 10-30 16:33:15 [gpu_worker.py:298] Available KV cache memory: 1.27 GiB
Finally, the engine fails to start with a ValueError:
(EngineCore_DP0 pid=7814) ERROR 10-30 16:33:16 [core.py:708] EngineCore failed to start.
(EngineCore_DP0 pid=7814) ERROR 10-30 16:33:16 [core.py:708] Traceback (most recent call last):
(EngineCore_DP0 pid=7814) ERROR 10-30 16:33:16 [core.py:708] File "/opt/python/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=7814) ERROR 10-30 16:33:16 [core.py:708] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=7814) ERROR 10-30 16:33:16 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=7814) ERROR 10-30 16:33:16 [core.py:708] File "/opt/python/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=7814) ERROR 10-30 16:33:16 [core.py:708] super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=7814) ERROR 10-30 16:33:16 [core.py:708] File "/opt/python/lib/python3.12/site-packages/kvcached/integration/vllm/patches.py", line 167, in _patched_engine_init
(EngineCore_DP0 pid=7814) ERROR 10-30 16:33:16 [core.py:708] return original_init(self, vllm_config, *args, **kwargs)
(EngineCore_DP0 pid=7814) ERROR 10-30 16:33:16 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=7814) ERROR 10-30 16:33:16 [core.py:708]s File "/opt/python/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 92, in __init__
(EngineCore_DP0 pid=7814) ERROR 10-30 16:33:16 [core.py:708] self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=7814) ERROR 10-30 16:33:16 [core.py:708] File "/opt/python/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 199, in _initialize_kv_caches
(EngineCore_DP0 pid=7814) ERROR 10-30 16:33:16 [core.py:708] kv_cache_configs = get_kv_cache_configs(vllm_config, kv_cache_specs,
(EngineCore_DP0 pid=7814) ERROR 10-30 16:33:16 [core.py:708]s ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=7814) ERROR 10-30 16:33:16 [core.py:708] File "/opt/python/lib/python3.12/site-packages/vllm/v1/core/kv_cache_utils.py", line 1243, in get_kv_cache_configs
(EngineCore_DP0 pid=7814) ERROR 10-30 16:33:16 [core.py:708]s check_enough_kv_cache_memory(vllm_config, kv_cache_spec_one_worker,
(EngineCore_DP0 pid=7814) ERROR 10-30 16:33:16 [core.py:708]s File "/opt/python/lib/python3.12/site-packages/vllm/v1/core/kv_cache_utils.py", line 716, in check_enough_kv_cache_memory
(EngineCore_DP0 pid=7814) ERROR 10-30 16:33:16 [core.py:708] raise ValueError(
(EngineCore_DP0 pid=7814) ERROR 10-30 16:33:16 [core.py:708] ValueError: To serve at least one request with the models's max seq len (32000), (1.22 GiB KV cache is needed, which is larger than the available KV cache memory (1.17 GiB). Based on the available memory, the estimated maximum model length is 30544. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
Question
The core of my confusion is: Do I need to explicitly set gpu_memory_utilization?
Should I be manually setting --gpu-memory-utilization to a lower value (e.g., 0.5, based on 20.08 / 39.49) to reflect the actual available memory?
Any clarification on how KVCached and gpu_memory_utilization are intended to interact would be greatly appreciated.
Hi @Nujjy, Thanks for using kvcached!
I did a simple calculation, and it seems that this behavior could be expected. Both models are 70B, so for each of them, when using FP16, it will need 70*2/8 = 17.5GB on each GPU. For two models, it would be 35GB. Note that this is a very rough esitmation, so the actual number could be larger than this. Also, the system needs some memory for activations, which could almost run of memory. In this case, if the system reports OOM errors, it is expected to me. Feel free to correct me if my calculation is wrong.
When using kvcached, you can ignore gpu_memory_utilization. No need to set its value explicitly.