vllm
vllm copied to clipboard
[Bug]: VLLM 0.8.2 OOM error (No error in 0.7.3 version)
Your current environment
Databricks VLLM version: 0.8.2
🐛 Describe the bug
I have been using VLLM for over 6 months with no problem, until recently which I started with vllm 0.8.2 version.
I am installing the vllm using:
pip install --upgrade vllm
then any model I try to load I immediately get OOM with the following error:
The Python process exited with exit code 137 (SIGKILL). This may have been cause by OOM error.
The model can correctly be loaded with vllm==0.7.3 version. I have tried increasing and decreasing gpu_memory_utilization from 0.9 to 0.5 and 0.96 and I have changed max_num_seqs to 256, still no luck. I have tried even to set environmental variable through: export VLLM_USE_V1 = 1 and 0 for v0 version, still no luck. May I know what cause this problem with 0.8.2 version which did not exist in 0.7.3 version, and is there any solution for this problem.
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
can you share your model/gpu/startup command and logs?
Yes, sure. The model is GPTQ quantized version of nvidia/Llama-3_3-Nemotron-Super-49B-v1 Command:
from vllm import LLM, SamplingParams
llm = LLM(model = model, max_model_len = 20000, trust_remote_code = True)
gives OOM error after 10 seconds.
For the GPTQ quantized version of llama 70B,
the same as above commands:
llm = LLM(model = model, max_model_len = 20000)
Either gives OOM or takes more than 30 min to load, while version 0.7.3 can load it in 9 min.
@robertgshaw2-redhat, I am confused, does the 0.8.2 or 0.8.1 versions require FlashInfer to be installed for efficient performance? This is very strange, either OOM or after 10 min the GPU memory usage is still 0. the 0.7.3 can load the whole thing less than 10 min.
~~To solve the OOM problem, I recommend reducing max_num_seqs as the default has increased from 256 in V0 to 1024 in V1.~~ Never mind, I see you have already done that, this is probably a different issue then.
same issue 👀
+1
+1
+1 with deepseek-r1
~To solve the OOM problem, I recommend reducing
max_num_seqsas the default has increased from 256 in V0 to 1024 in V1.~ Never mind, I see you have already done that, this is probably a different issue then.
I ran into this too. Setting max_num_seqs to the same value as v0 didn’t work for me, but lowering it (e.g., 64) fixed it. It might be worth a shot.
+1
+1 with qwen2.5-vl-awq 0.8.2, same parameters with 0.8.1 is ok。
i have same question
I have the same issue, with V0 I can serve mistral3.1-awq with 4k context length on 24G GPU but I have OOM if I use V1. check here.
+1
I have 2x L40s
cannot reproduce with Meta-Llama-3.1-8B-Instruct-quantized.w8a16 https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a16
invoked like this:
VLLM_USE_V1=1 vllm serve Meta-Llama-3.1-8B-Instruct-quantized.w8a16 --host 0.0.0.0 --served-model-name llama3.1-8B llama3.1-8B-Int8 --port 8000 --max-model-len 65536 --enable-auto-tool-choice --tool-call-parser llama3_json
Mistral-3.1-24B https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503 works also fine for me invoked like this with vllm 0.8.2 a)
VLLM_USE_V1=0 vllm serve mistralai/Mistral-Small-3.1-24B-Base-2503 --tensor-parallel-size 1 \
--max-model-len 32000 --gpu-memory-utilization 0.90 --distributed-executor-backend mp \
--served-model-name mistral --tokenizer-mode mistral --config-format mistral --load_format mistral
b)
VLLM_USE_V1=1 vllm serve mistralai/Mistral-Small-3.1-24B-Base-2503 --tensor-parallel-size 1 \
--max-model-len 32000 --gpu-memory-utilization 0.90 --distributed-executor-backend mp \
--served-model-name mistral --tokenizer-mode mistral --config-format mistral --load_format mistral
Hi @manitadayon , I am downloading https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1 now.
How did you quantize it? Using huggingface + autogptq? How many bits?
Thank you and best regards
@paolovic 4-bit quantization with autogptq and HF. The models you have tried are very small, that’s one point. The point is not 0.8.2 is not working, it may work, I actually made it to work for Llama 70b, but the problem is it mess up with the memory and handles memory very inefficiently.
None of these problems even exist in 0.7.3.
Now the 0.8.3 is even worse, I cannot make anything to work and always the same error, no matter the configuration.
The Error in 0.8.3 is coming mainly as EOFError, which is mainly OOM.
Hi everyone,
I managed to solve the OOM issue for most of the models besides the nemotron reasoning one by passing
enforce_eager = True parameter.
@manitadayon Mistral in half precision is larger than nvidia/Llama-3_3-Nemotron-Super-49B-v1 in 4-bit.
Anyway, as I was hoping for a memory leak, I was hoping they would lead to an OOM error in any case.
Alright, thank you very much, I will try to reproduce the issue.
Hi @manitadayon ,
is it possible that you experienced your OOM error while computing the CUDA graph?
Because enforce_eager=True is a way to circumvent this particular OOM during CUDA graph computation.
It may be the case, it is just I have played with parameters such as gpu_utilization and max_num_seq and reduced them to the very low number but still the error persists.
alright, I'm quantizing the nvidia/Llama-3_3-Nemotron-Super-49B-v1 to 4bit GPTQ right now
Same problem with neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16. Already 96 GB of memory.
model: "neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16"
tensor-parallel-size: 2
max-model-len: 32768
max-num-seqs: 4
Hi @manitadayon , nice I was able to reproduce the error.
Same machine, 2x Nvidia L40s, vllm 0.8.3
- V0 works as follows:
CUDA_VISIBLE_DEVICES=0 VLLM_USE_V1=0 vllm serve Llama-3_3-Nemotron-Super-49B-v1-4bit-GPTQ/ --trust-remote-code -q gptq --max-model-len 32768
INFO 04-11 09:34:18 [model_runner.py:1598] Graph capturing finished in 29 secs, took 3.34 GiB
INFO 04-11 09:34:18 [llm_engine.py:448] init engine (profile, create kv cache, warmup model) took 46.05 seconds
INFO 04-11 09:34:18 [api_server.py:1081] Starting vLLM API server on http://0.0.0.0:8000
INFO 04-11 09:34:18 [launcher.py:26] Available routes are:
INFO 04-11 09:34:18 [launcher.py:34] Route: /openapi.json, Methods: HEAD, GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /docs, Methods: HEAD, GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /redoc, Methods: HEAD, GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /health, Methods: GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /load, Methods: GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /ping, Methods: POST, GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /tokenize, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /detokenize, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/models, Methods: GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /version, Methods: GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/completions, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/embeddings, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /pooling, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /score, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/score, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /rerank, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/rerank, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /invocations, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /metrics, Methods: GET
INFO: Started server process [2096773]
INFO: Waiting for application startup.
INFO: Application startup complete.
- V1 fails:
CUDA_VISIBLE_DEVICES=1 VLLM_USE_V1=1 vllm serve Llama-3_3-Nemotron-Super-49B-v1-4bit-GPTQ/ --trust-remote-code -q gptq --max-model-len 32768
INFO 04-11 09:30:46 [loader.py:447] Loading weights took 3.71 seconds
INFO 04-11 09:30:47 [gpu_model_runner.py:1273] Model loading took 27.0754 GiB and 4.111081 seconds
INFO 04-11 09:30:59 [backends.py:416] Using cache directory: /.cache/vllm/torch_compile_cache/1a15678859/rank_0_0 for vLLM's torch.compile
INFO 04-11 09:30:59 [backends.py:426] Dynamo bytecode transform time: 12.46 s
INFO 04-11 09:31:02 [backends.py:132] Cache the graph of shape None for later use
INFO 04-11 09:31:47 [backends.py:144] Compiling a graph for general shape takes 47.02 s
INFO 04-11 09:32:07 [monitor.py:33] torch.compile takes 59.47 s in total
ERROR 04-11 09:32:08 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-11 09:32:08 [core.py:390] File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 378, in run_engine_core
ERROR 04-11 09:32:08 [core.py:390] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-11 09:32:08 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 09:32:08 [core.py:390] File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in __init__
ERROR 04-11 09:32:08 [core.py:390] super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-11 09:32:08 [core.py:390] File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 71, in __init__
ERROR 04-11 09:32:08 [core.py:390] self._initialize_kv_caches(vllm_config)
ERROR 04-11 09:32:08 [core.py:390] File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 136, in _initialize_kv_caches
ERROR 04-11 09:32:08 [core.py:390] kv_cache_configs = [
ERROR 04-11 09:32:08 [core.py:390] ^
ERROR 04-11 09:32:08 [core.py:390] File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 137, in <listcomp>
ERROR 04-11 09:32:08 [core.py:390] get_kv_cache_config(vllm_config, kv_cache_spec_one_worker,
ERROR 04-11 09:32:08 [core.py:390] File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/core/kv_cache_utils.py", line 643, in get_kv_cache_config
ERROR 04-11 09:32:08 [core.py:390] check_enough_kv_cache_memory(vllm_config, kv_cache_spec, available_memory)
ERROR 04-11 09:32:08 [core.py:390] File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/core/kv_cache_utils.py", line 490, in check_enough_kv_cache_memor
y
ERROR 04-11 09:32:08 [core.py:390] raise ValueError(
ERROR 04-11 09:32:08 [core.py:390] ValueError: To serve at least one request with the models's max seq len (32768), (6.12 GiB KV cache is needed, which is larger than the available KV cache
memory (5.63 GiB). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
ERROR 04-11 09:32:08 [core.py:390]
CRITICAL 04-11 09:32:08 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed
Now, I can investigate.
@DarkLight1337 FYI
Thank you. Oh, you are able to run the model on 1 GPU in V0 version (only 48GB memory)? (since you set the CUDA visible device to only 0). But I see you are passing the quantization parameter, so it is not using the Marlin kernel that detects the quantization by default?
Thank you. Oh, you are able to run the model on 1 GPU in V0 version (only 48GB memory)? (since you set the CUDA visible device to only 0). But I see you are passing the quantization parameter, so it is not using the Marlin kernel that detects the quantization by default?
without -q gptq, V0 works as well (I had to reduce the max_model_len like previously though)
CUDA_VISIBLE_DEVICES=0 VLLM_USE_V1=0 vllm serve Llama-3_3-Nemotron-Super-49B-v1-4bit-GPTQ/ --trust-remote-code --max-model-len 32768
and V1 fails again
CUDA_VISIBLE_DEVICES=1 VLLM_USE_V1=1 vllm serve Llama-3_3-Nemotron-Super-49B-v1-4bit-GPTQ/ --trust-remote-code --max-model-len 32768
INFO 04-11 09:52:55 [backends.py:426] Dynamo bytecode transform time: 12.96 s
INFO 04-11 09:52:58 [backends.py:132] Cache the graph of shape None for later use
INFO 04-11 09:53:45 [backends.py:144] Compiling a graph for general shape takes 49.14 s
INFO 04-11 09:54:08 [monitor.py:33] torch.compile takes 62.10 s in total
ERROR 04-11 09:54:09 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-11 09:54:09 [core.py:390] File "/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 378, in run_engine_core
ERROR 04-11 09:54:09 [core.py:390] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-11 09:54:09 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 09:54:09 [core.py:390] File "/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in __init__
ERROR 04-11 09:54:09 [core.py:390] super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-11 09:54:09 [core.py:390] File "/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 71, in __init__
ERROR 04-11 09:54:09 [core.py:390] self._initialize_kv_caches(vllm_config)
ERROR 04-11 09:54:09 [core.py:390] File "/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 136, in _initialize_kv_caches
ERROR 04-11 09:54:09 [core.py:390] kv_cache_configs = [
ERROR 04-11 09:54:09 [core.py:390] ^
ERROR 04-11 09:54:09 [core.py:390] File "/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 137, in <listcomp>
ERROR 04-11 09:54:09 [core.py:390] get_kv_cache_config(vllm_config, kv_cache_spec_one_worker,
ERROR 04-11 09:54:09 [core.py:390] File "/quantization_env/lib64/python3.11/site-packages/vllm/v1/core/kv_cache_utils.py", line 643, in get_kv_cache_config
ERROR 04-11 09:54:09 [core.py:390] check_enough_kv_cache_memory(vllm_config, kv_cache_spec, available_memory)
ERROR 04-11 09:54:09 [core.py:390] File "/quantization_env/lib64/python3.11/site-packages/vllm/v1/core/kv_cache_utils.py", line 490, in check_enough_kv_cache_memory
ERROR 04-11 09:54:09 [core.py:390] raise ValueError(
ERROR 04-11 09:54:09 [core.py:390] ValueError: To serve at least one request with the models's max seq len (32768), (6.12 GiB KV cache is needed, which is larger than the available KV cache memory (5.68 GiB). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
ERROR 04-11 09:54:09 [core.py:390]
CRITICAL 04-11 09:54:09 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed
Just a note if this helps to repro the issue:
I had the same issue with v0 also on vllm 0.8.3
I tried running minimax model on 8xH100. No luck.
It failed even with a seq length of 1.
Furthermore, if I run it with vllm 0.7.1, it runs just fine.
Thanks
On Fri, Apr 11, 2025 at 2:52 AM paolovic @.***> wrote:
Thank you. Oh, you are able to run the model on 1 GPU in V0 version (only 48GB memory)? (since you set the CUDA visible device to only 0). But I see you are passing the quantization parameter, so it is not using the Marlin kernel that detects the quantization by default?
without -q gptq, V0 works as well (I had to reduce the max_model_len like previously though)
CUDA_VISIBLE_DEVICES=0 VLLM_USE_V1=0 vllm serve Llama-3_3-Nemotron-Super-49B-v1-4bit-GPTQ/ --trust-remote-code --max-model-len 32768
— Reply to this email directly, view it on GitHub https://github.com/vllm-project/vllm/issues/15664#issuecomment-2796128271, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADGHVYYCNG5Z3R57ZVWGZRT2Y5YDHAVCNFSM6AAAAABZ6NS6DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJWGEZDQMRXGE . You are receiving this because you commented.Message ID: @.***> paolovic left a comment (vllm-project/vllm#15664) https://github.com/vllm-project/vllm/issues/15664#issuecomment-2796128271
Thank you. Oh, you are able to run the model on 1 GPU in V0 version (only 48GB memory)? (since you set the CUDA visible device to only 0). But I see you are passing the quantization parameter, so it is not using the Marlin kernel that detects the quantization by default?
without -q gptq, V0 works as well (I had to reduce the max_model_len like previously though)
CUDA_VISIBLE_DEVICES=0 VLLM_USE_V1=0 vllm serve Llama-3_3-Nemotron-Super-49B-v1-4bit-GPTQ/ --trust-remote-code --max-model-len 32768
— Reply to this email directly, view it on GitHub https://github.com/vllm-project/vllm/issues/15664#issuecomment-2796128271, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADGHVYYCNG5Z3R57ZVWGZRT2Y5YDHAVCNFSM6AAAAABZ6NS6DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJWGEZDQMRXGE . You are receiving this because you commented.Message ID: @.***>
Hi @manitadayon , nice I was able to reproduce the error.
Same machine, 2x Nvidia L40s,
vllm 0.8.3
- V0 works as follows:
CUDA_VISIBLE_DEVICES=0 VLLM_USE_V1=0 vllm serve Llama-3_3-Nemotron-Super-49B-v1-4bit-GPTQ/ --trust-remote-code -q gptq --max-model-len 32768 INFO 04-11 09:34:18 [model_runner.py:1598] Graph capturing finished in 29 secs, took 3.34 GiB INFO 04-11 09:34:18 [llm_engine.py:448] init engine (profile, create kv cache, warmup model) took 46.05 seconds INFO 04-11 09:34:18 [api_server.py:1081] Starting vLLM API server on http://0.0.0.0:8000 INFO 04-11 09:34:18 [launcher.py:26] Available routes are: INFO 04-11 09:34:18 [launcher.py:34] Route: /openapi.json, Methods: HEAD, GET INFO 04-11 09:34:18 [launcher.py:34] Route: /docs, Methods: HEAD, GET INFO 04-11 09:34:18 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: HEAD, GET INFO 04-11 09:34:18 [launcher.py:34] Route: /redoc, Methods: HEAD, GET INFO 04-11 09:34:18 [launcher.py:34] Route: /health, Methods: GET INFO 04-11 09:34:18 [launcher.py:34] Route: /load, Methods: GET INFO 04-11 09:34:18 [launcher.py:34] Route: /ping, Methods: POST, GET INFO 04-11 09:34:18 [launcher.py:34] Route: /tokenize, Methods: POST INFO 04-11 09:34:18 [launcher.py:34] Route: /detokenize, Methods: POST INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/models, Methods: GET INFO 04-11 09:34:18 [launcher.py:34] Route: /version, Methods: GET INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/chat/completions, Methods: POST INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/completions, Methods: POST INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/embeddings, Methods: POST INFO 04-11 09:34:18 [launcher.py:34] Route: /pooling, Methods: POST INFO 04-11 09:34:18 [launcher.py:34] Route: /score, Methods: POST INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/score, Methods: POST INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST INFO 04-11 09:34:18 [launcher.py:34] Route: /rerank, Methods: POST INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/rerank, Methods: POST INFO 04-11 09:34:18 [launcher.py:34] Route: /v2/rerank, Methods: POST INFO 04-11 09:34:18 [launcher.py:34] Route: /invocations, Methods: POST INFO 04-11 09:34:18 [launcher.py:34] Route: /metrics, Methods: GET INFO: Started server process [2096773] INFO: Waiting for application startup. INFO: Application startup complete.
- V1 fails:
CUDA_VISIBLE_DEVICES=1 VLLM_USE_V1=1 vllm serve Llama-3_3-Nemotron-Super-49B-v1-4bit-GPTQ/ --trust-remote-code -q gptq --max-model-len 32768 INFO 04-11 09:30:46 [loader.py:447] Loading weights took 3.71 seconds INFO 04-11 09:30:47 [gpu_model_runner.py:1273] Model loading took 27.0754 GiB and 4.111081 seconds INFO 04-11 09:30:59 [backends.py:416] Using cache directory: /.cache/vllm/torch_compile_cache/1a15678859/rank_0_0 for vLLM's torch.compile INFO 04-11 09:30:59 [backends.py:426] Dynamo bytecode transform time: 12.46 s INFO 04-11 09:31:02 [backends.py:132] Cache the graph of shape None for later use INFO 04-11 09:31:47 [backends.py:144] Compiling a graph for general shape takes 47.02 s INFO 04-11 09:32:07 [monitor.py:33] torch.compile takes 59.47 s in total ERROR 04-11 09:32:08 [core.py:390] EngineCore hit an exception: Traceback (most recent call last): ERROR 04-11 09:32:08 [core.py:390] File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 378, in run_engine_core ERROR 04-11 09:32:08 [core.py:390] engine_core = EngineCoreProc(*args, **kwargs) ERROR 04-11 09:32:08 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-11 09:32:08 [core.py:390] File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in __init__ ERROR 04-11 09:32:08 [core.py:390] super().__init__(vllm_config, executor_class, log_stats) ERROR 04-11 09:32:08 [core.py:390] File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 71, in __init__ ERROR 04-11 09:32:08 [core.py:390] self._initialize_kv_caches(vllm_config) ERROR 04-11 09:32:08 [core.py:390] File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 136, in _initialize_kv_caches ERROR 04-11 09:32:08 [core.py:390] kv_cache_configs = [ ERROR 04-11 09:32:08 [core.py:390] ^ ERROR 04-11 09:32:08 [core.py:390] File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 137, in <listcomp> ERROR 04-11 09:32:08 [core.py:390] get_kv_cache_config(vllm_config, kv_cache_spec_one_worker, ERROR 04-11 09:32:08 [core.py:390] File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/core/kv_cache_utils.py", line 643, in get_kv_cache_config ERROR 04-11 09:32:08 [core.py:390] check_enough_kv_cache_memory(vllm_config, kv_cache_spec, available_memory) ERROR 04-11 09:32:08 [core.py:390] File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/core/kv_cache_utils.py", line 490, in check_enough_kv_cache_memor y ERROR 04-11 09:32:08 [core.py:390] raise ValueError( ERROR 04-11 09:32:08 [core.py:390] ValueError: To serve at least one request with the models's max seq len (32768), (6.12 GiB KV cache is needed, which is larger than the available KV cache memory (5.63 GiB). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. ERROR 04-11 09:32:08 [core.py:390] CRITICAL 04-11 09:32:08 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue. KilledNow, I can investigate.
@DarkLight1337 FYI
Thanks @paolovic. I can confirm that setting VLLM_USE_V1=0 will fix the problem.
@paolovic @hmellor @DarkLight1337 could you please check this ticket related to vllm version 0.8.3? https://github.com/vllm-project/vllm/issues/16552
+1 . I am using vllm like llm=LLM(model=model_name). I am currently installing vllm 0.73 which is messing with all my other packages due to it's dependencies so I cannot wait for the latest version of VLLM to work properly!