vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: VLLM 0.8.2 OOM error (No error in 0.7.3 version)

Open manitadayon opened this issue 7 months ago • 28 comments
trafficstars

Your current environment

Databricks VLLM version: 0.8.2

🐛 Describe the bug

I have been using VLLM for over 6 months with no problem, until recently which I started with vllm 0.8.2 version. I am installing the vllm using: pip install --upgrade vllm then any model I try to load I immediately get OOM with the following error: The Python process exited with exit code 137 (SIGKILL). This may have been cause by OOM error.

The model can correctly be loaded with vllm==0.7.3 version. I have tried increasing and decreasing gpu_memory_utilization from 0.9 to 0.5 and 0.96 and I have changed max_num_seqs to 256, still no luck. I have tried even to set environmental variable through: export VLLM_USE_V1 = 1 and 0 for v0 version, still no luck. May I know what cause this problem with 0.8.2 version which did not exist in 0.7.3 version, and is there any solution for this problem.

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

manitadayon avatar Mar 28 '25 01:03 manitadayon

can you share your model/gpu/startup command and logs?

robertgshaw2-redhat avatar Mar 28 '25 02:03 robertgshaw2-redhat

Yes, sure. The model is GPTQ quantized version of nvidia/Llama-3_3-Nemotron-Super-49B-v1 Command:

from vllm import LLM, SamplingParams
llm = LLM(model = model, max_model_len = 20000, trust_remote_code = True)

gives OOM error after 10 seconds.

For the GPTQ quantized version of llama 70B, the same as above commands: llm = LLM(model = model, max_model_len = 20000) Either gives OOM or takes more than 30 min to load, while version 0.7.3 can load it in 9 min.

manitadayon avatar Mar 28 '25 02:03 manitadayon

@robertgshaw2-redhat, I am confused, does the 0.8.2 or 0.8.1 versions require FlashInfer to be installed for efficient performance? This is very strange, either OOM or after 10 min the GPU memory usage is still 0. the 0.7.3 can load the whole thing less than 10 min.

manitadayon avatar Mar 28 '25 02:03 manitadayon

~~To solve the OOM problem, I recommend reducing max_num_seqs as the default has increased from 256 in V0 to 1024 in V1.~~ Never mind, I see you have already done that, this is probably a different issue then.

DarkLight1337 avatar Mar 28 '25 05:03 DarkLight1337

same issue 👀

spitzblattr avatar Mar 28 '25 08:03 spitzblattr

+1

twright8 avatar Mar 28 '25 12:03 twright8

+1

DaBossCoda avatar Mar 28 '25 16:03 DaBossCoda

+1 with deepseek-r1

Hugh-yw avatar Mar 29 '25 10:03 Hugh-yw

~To solve the OOM problem, I recommend reducing max_num_seqs as the default has increased from 256 in V0 to 1024 in V1.~ Never mind, I see you have already done that, this is probably a different issue then.

I ran into this too. Setting max_num_seqs to the same value as v0 didn’t work for me, but lowering it (e.g., 64) fixed it. It might be worth a shot.

gaocegege avatar Mar 30 '25 02:03 gaocegege

+1

xvshengjie avatar Mar 30 '25 11:03 xvshengjie

+1 with qwen2.5-vl-awq 0.8.2, same parameters with 0.8.1 is ok。

DayDayupupupup avatar Mar 31 '25 02:03 DayDayupupupup

i have same question

codeido avatar Apr 02 '25 06:04 codeido

I have the same issue, with V0 I can serve mistral3.1-awq with 4k context length on 24G GPU but I have OOM if I use V1. check here.

hahmad2008 avatar Apr 07 '25 10:04 hahmad2008

+1

rakshithvasudev avatar Apr 07 '25 17:04 rakshithvasudev

I have 2x L40s cannot reproduce with Meta-Llama-3.1-8B-Instruct-quantized.w8a16 https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a16 invoked like this:

VLLM_USE_V1=1 vllm serve Meta-Llama-3.1-8B-Instruct-quantized.w8a16 --host 0.0.0.0 --served-model-name llama3.1-8B llama3.1-8B-Int8 --port 8000 --max-model-len 65536 --enable-auto-tool-choice --tool-call-parser llama3_json

Mistral-3.1-24B https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503 works also fine for me invoked like this with vllm 0.8.2 a)

 VLLM_USE_V1=0 vllm serve mistralai/Mistral-Small-3.1-24B-Base-2503  --tensor-parallel-size 1  \
--max-model-len 32000 --gpu-memory-utilization 0.90   --distributed-executor-backend mp  \
--served-model-name mistral --tokenizer-mode mistral --config-format mistral --load_format mistral

b)

 VLLM_USE_V1=1 vllm serve mistralai/Mistral-Small-3.1-24B-Base-2503  --tensor-parallel-size 1  \
--max-model-len 32000 --gpu-memory-utilization 0.90   --distributed-executor-backend mp  \
--served-model-name mistral --tokenizer-mode mistral --config-format mistral --load_format mistral

paolovic avatar Apr 08 '25 22:04 paolovic

Hi @manitadayon , I am downloading https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1 now.

How did you quantize it? Using huggingface + autogptq? How many bits?

Thank you and best regards

paolovic avatar Apr 08 '25 22:04 paolovic

@paolovic 4-bit quantization with autogptq and HF. The models you have tried are very small, that’s one point. The point is not 0.8.2 is not working, it may work, I actually made it to work for Llama 70b, but the problem is it mess up with the memory and handles memory very inefficiently. None of these problems even exist in 0.7.3. Now the 0.8.3 is even worse, I cannot make anything to work and always the same error, no matter the configuration.
The Error in 0.8.3 is coming mainly as EOFError, which is mainly OOM.

manitadayon avatar Apr 08 '25 22:04 manitadayon

Hi everyone, I managed to solve the OOM issue for most of the models besides the nemotron reasoning one by passing enforce_eager = True parameter.

manitadayon avatar Apr 09 '25 06:04 manitadayon

@manitadayon Mistral in half precision is larger than nvidia/Llama-3_3-Nemotron-Super-49B-v1 in 4-bit. Anyway, as I was hoping for a memory leak, I was hoping they would lead to an OOM error in any case.

Alright, thank you very much, I will try to reproduce the issue.

paolovic avatar Apr 09 '25 07:04 paolovic

Hi @manitadayon , is it possible that you experienced your OOM error while computing the CUDA graph? Because enforce_eager=True is a way to circumvent this particular OOM during CUDA graph computation.

paolovic avatar Apr 09 '25 10:04 paolovic

It may be the case, it is just I have played with parameters such as gpu_utilization and max_num_seq and reduced them to the very low number but still the error persists.

manitadayon avatar Apr 09 '25 21:04 manitadayon

alright, I'm quantizing the nvidia/Llama-3_3-Nemotron-Super-49B-v1 to 4bit GPTQ right now

paolovic avatar Apr 10 '25 14:04 paolovic

Same problem with neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16. Already 96 GB of memory.

  model: "neuralmagic/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16"
  tensor-parallel-size: 2
  max-model-len: 32768
  max-num-seqs: 4

cpwan avatar Apr 11 '25 05:04 cpwan

Hi @manitadayon , nice I was able to reproduce the error.

Same machine, 2x Nvidia L40s, vllm 0.8.3

  1. V0 works as follows:
CUDA_VISIBLE_DEVICES=0 VLLM_USE_V1=0 vllm serve Llama-3_3-Nemotron-Super-49B-v1-4bit-GPTQ/ --trust-remote-code -q gptq --max-model-len 32768

INFO 04-11 09:34:18 [model_runner.py:1598] Graph capturing finished in 29 secs, took 3.34 GiB
INFO 04-11 09:34:18 [llm_engine.py:448] init engine (profile, create kv cache, warmup model) took 46.05 seconds
INFO 04-11 09:34:18 [api_server.py:1081] Starting vLLM API server on http://0.0.0.0:8000
INFO 04-11 09:34:18 [launcher.py:26] Available routes are:
INFO 04-11 09:34:18 [launcher.py:34] Route: /openapi.json, Methods: HEAD, GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /docs, Methods: HEAD, GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /redoc, Methods: HEAD, GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /health, Methods: GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /load, Methods: GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /ping, Methods: POST, GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /tokenize, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /detokenize, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/models, Methods: GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /version, Methods: GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/completions, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/embeddings, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /pooling, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /score, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/score, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /rerank, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/rerank, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /invocations, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /metrics, Methods: GET
INFO:     Started server process [2096773]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
  1. V1 fails:
CUDA_VISIBLE_DEVICES=1 VLLM_USE_V1=1 vllm serve Llama-3_3-Nemotron-Super-49B-v1-4bit-GPTQ/ --trust-remote-code -q gptq --max-model-len 32768

INFO 04-11 09:30:46 [loader.py:447] Loading weights took 3.71 seconds
INFO 04-11 09:30:47 [gpu_model_runner.py:1273] Model loading took 27.0754 GiB and 4.111081 seconds
INFO 04-11 09:30:59 [backends.py:416] Using cache directory: /.cache/vllm/torch_compile_cache/1a15678859/rank_0_0 for vLLM's torch.compile
INFO 04-11 09:30:59 [backends.py:426] Dynamo bytecode transform time: 12.46 s
INFO 04-11 09:31:02 [backends.py:132] Cache the graph of shape None for later use
INFO 04-11 09:31:47 [backends.py:144] Compiling a graph for general shape takes 47.02 s
INFO 04-11 09:32:07 [monitor.py:33] torch.compile takes 59.47 s in total
ERROR 04-11 09:32:08 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-11 09:32:08 [core.py:390]   File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 378, in run_engine_core
ERROR 04-11 09:32:08 [core.py:390]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-11 09:32:08 [core.py:390]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 09:32:08 [core.py:390]   File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in __init__
ERROR 04-11 09:32:08 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-11 09:32:08 [core.py:390]   File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 71, in __init__
ERROR 04-11 09:32:08 [core.py:390]     self._initialize_kv_caches(vllm_config)
ERROR 04-11 09:32:08 [core.py:390]   File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 136, in _initialize_kv_caches
ERROR 04-11 09:32:08 [core.py:390]     kv_cache_configs = [
ERROR 04-11 09:32:08 [core.py:390]                        ^
ERROR 04-11 09:32:08 [core.py:390]   File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 137, in <listcomp>
ERROR 04-11 09:32:08 [core.py:390]     get_kv_cache_config(vllm_config, kv_cache_spec_one_worker,
ERROR 04-11 09:32:08 [core.py:390]   File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/core/kv_cache_utils.py", line 643, in get_kv_cache_config
ERROR 04-11 09:32:08 [core.py:390]     check_enough_kv_cache_memory(vllm_config, kv_cache_spec, available_memory)
ERROR 04-11 09:32:08 [core.py:390]   File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/core/kv_cache_utils.py", line 490, in check_enough_kv_cache_memor
y
ERROR 04-11 09:32:08 [core.py:390]     raise ValueError(
ERROR 04-11 09:32:08 [core.py:390] ValueError: To serve at least one request with the models's max seq len (32768), (6.12 GiB KV cache is needed, which is larger than the available KV cache
memory (5.63 GiB). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
ERROR 04-11 09:32:08 [core.py:390]
CRITICAL 04-11 09:32:08 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed

Now, I can investigate.

@DarkLight1337 FYI

paolovic avatar Apr 11 '25 07:04 paolovic

Thank you. Oh, you are able to run the model on 1 GPU in V0 version (only 48GB memory)? (since you set the CUDA visible device to only 0). But I see you are passing the quantization parameter, so it is not using the Marlin kernel that detects the quantization by default?

manitadayon avatar Apr 11 '25 07:04 manitadayon

Thank you. Oh, you are able to run the model on 1 GPU in V0 version (only 48GB memory)? (since you set the CUDA visible device to only 0). But I see you are passing the quantization parameter, so it is not using the Marlin kernel that detects the quantization by default?

without -q gptq, V0 works as well (I had to reduce the max_model_len like previously though)

CUDA_VISIBLE_DEVICES=0 VLLM_USE_V1=0 vllm serve Llama-3_3-Nemotron-Super-49B-v1-4bit-GPTQ/ --trust-remote-code --max-model-len 32768

and V1 fails again

CUDA_VISIBLE_DEVICES=1 VLLM_USE_V1=1 vllm serve Llama-3_3-Nemotron-Super-49B-v1-4bit-GPTQ/ --trust-remote-code --max-model-len 32768

INFO 04-11 09:52:55 [backends.py:426] Dynamo bytecode transform time: 12.96 s
INFO 04-11 09:52:58 [backends.py:132] Cache the graph of shape None for later use
INFO 04-11 09:53:45 [backends.py:144] Compiling a graph for general shape takes 49.14 s
INFO 04-11 09:54:08 [monitor.py:33] torch.compile takes 62.10 s in total
ERROR 04-11 09:54:09 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-11 09:54:09 [core.py:390]   File "/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 378, in run_engine_core
ERROR 04-11 09:54:09 [core.py:390]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-11 09:54:09 [core.py:390]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 09:54:09 [core.py:390]   File "/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in __init__
ERROR 04-11 09:54:09 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-11 09:54:09 [core.py:390]   File "/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 71, in __init__
ERROR 04-11 09:54:09 [core.py:390]     self._initialize_kv_caches(vllm_config)
ERROR 04-11 09:54:09 [core.py:390]   File "/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 136, in _initialize_kv_caches
ERROR 04-11 09:54:09 [core.py:390]     kv_cache_configs = [
ERROR 04-11 09:54:09 [core.py:390]                        ^
ERROR 04-11 09:54:09 [core.py:390]   File "/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 137, in <listcomp>
ERROR 04-11 09:54:09 [core.py:390]     get_kv_cache_config(vllm_config, kv_cache_spec_one_worker,
ERROR 04-11 09:54:09 [core.py:390]   File "/quantization_env/lib64/python3.11/site-packages/vllm/v1/core/kv_cache_utils.py", line 643, in get_kv_cache_config
ERROR 04-11 09:54:09 [core.py:390]     check_enough_kv_cache_memory(vllm_config, kv_cache_spec, available_memory)
ERROR 04-11 09:54:09 [core.py:390]   File "/quantization_env/lib64/python3.11/site-packages/vllm/v1/core/kv_cache_utils.py", line 490, in check_enough_kv_cache_memory
ERROR 04-11 09:54:09 [core.py:390]     raise ValueError(
ERROR 04-11 09:54:09 [core.py:390] ValueError: To serve at least one request with the models's max seq len (32768), (6.12 GiB KV cache is needed, which is larger than the available KV cache memory (5.68 GiB). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
ERROR 04-11 09:54:09 [core.py:390]
CRITICAL 04-11 09:54:09 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed

paolovic avatar Apr 11 '25 07:04 paolovic

Just a note if this helps to repro the issue:

I had the same issue with v0 also on vllm 0.8.3

I tried running minimax model on 8xH100. No luck.

It failed even with a seq length of 1.

Furthermore, if I run it with vllm 0.7.1, it runs just fine.

Thanks

On Fri, Apr 11, 2025 at 2:52 AM paolovic @.***> wrote:

Thank you. Oh, you are able to run the model on 1 GPU in V0 version (only 48GB memory)? (since you set the CUDA visible device to only 0). But I see you are passing the quantization parameter, so it is not using the Marlin kernel that detects the quantization by default?

without -q gptq, V0 works as well (I had to reduce the max_model_len like previously though)

CUDA_VISIBLE_DEVICES=0 VLLM_USE_V1=0 vllm serve Llama-3_3-Nemotron-Super-49B-v1-4bit-GPTQ/ --trust-remote-code --max-model-len 32768

— Reply to this email directly, view it on GitHub https://github.com/vllm-project/vllm/issues/15664#issuecomment-2796128271, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADGHVYYCNG5Z3R57ZVWGZRT2Y5YDHAVCNFSM6AAAAABZ6NS6DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJWGEZDQMRXGE . You are receiving this because you commented.Message ID: @.***> paolovic left a comment (vllm-project/vllm#15664) https://github.com/vllm-project/vllm/issues/15664#issuecomment-2796128271

Thank you. Oh, you are able to run the model on 1 GPU in V0 version (only 48GB memory)? (since you set the CUDA visible device to only 0). But I see you are passing the quantization parameter, so it is not using the Marlin kernel that detects the quantization by default?

without -q gptq, V0 works as well (I had to reduce the max_model_len like previously though)

CUDA_VISIBLE_DEVICES=0 VLLM_USE_V1=0 vllm serve Llama-3_3-Nemotron-Super-49B-v1-4bit-GPTQ/ --trust-remote-code --max-model-len 32768

— Reply to this email directly, view it on GitHub https://github.com/vllm-project/vllm/issues/15664#issuecomment-2796128271, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADGHVYYCNG5Z3R57ZVWGZRT2Y5YDHAVCNFSM6AAAAABZ6NS6DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJWGEZDQMRXGE . You are receiving this because you commented.Message ID: @.***>

rakshithvasudev avatar Apr 11 '25 15:04 rakshithvasudev

Hi @manitadayon , nice I was able to reproduce the error.

Same machine, 2x Nvidia L40s, vllm 0.8.3

  1. V0 works as follows:
CUDA_VISIBLE_DEVICES=0 VLLM_USE_V1=0 vllm serve Llama-3_3-Nemotron-Super-49B-v1-4bit-GPTQ/ --trust-remote-code -q gptq --max-model-len 32768

INFO 04-11 09:34:18 [model_runner.py:1598] Graph capturing finished in 29 secs, took 3.34 GiB
INFO 04-11 09:34:18 [llm_engine.py:448] init engine (profile, create kv cache, warmup model) took 46.05 seconds
INFO 04-11 09:34:18 [api_server.py:1081] Starting vLLM API server on http://0.0.0.0:8000
INFO 04-11 09:34:18 [launcher.py:26] Available routes are:
INFO 04-11 09:34:18 [launcher.py:34] Route: /openapi.json, Methods: HEAD, GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /docs, Methods: HEAD, GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /redoc, Methods: HEAD, GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /health, Methods: GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /load, Methods: GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /ping, Methods: POST, GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /tokenize, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /detokenize, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/models, Methods: GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /version, Methods: GET
INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/completions, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/embeddings, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /pooling, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /score, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/score, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /rerank, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /v1/rerank, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /invocations, Methods: POST
INFO 04-11 09:34:18 [launcher.py:34] Route: /metrics, Methods: GET
INFO:     Started server process [2096773]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
  1. V1 fails:
CUDA_VISIBLE_DEVICES=1 VLLM_USE_V1=1 vllm serve Llama-3_3-Nemotron-Super-49B-v1-4bit-GPTQ/ --trust-remote-code -q gptq --max-model-len 32768

INFO 04-11 09:30:46 [loader.py:447] Loading weights took 3.71 seconds
INFO 04-11 09:30:47 [gpu_model_runner.py:1273] Model loading took 27.0754 GiB and 4.111081 seconds
INFO 04-11 09:30:59 [backends.py:416] Using cache directory: /.cache/vllm/torch_compile_cache/1a15678859/rank_0_0 for vLLM's torch.compile
INFO 04-11 09:30:59 [backends.py:426] Dynamo bytecode transform time: 12.46 s
INFO 04-11 09:31:02 [backends.py:132] Cache the graph of shape None for later use
INFO 04-11 09:31:47 [backends.py:144] Compiling a graph for general shape takes 47.02 s
INFO 04-11 09:32:07 [monitor.py:33] torch.compile takes 59.47 s in total
ERROR 04-11 09:32:08 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-11 09:32:08 [core.py:390]   File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 378, in run_engine_core
ERROR 04-11 09:32:08 [core.py:390]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-11 09:32:08 [core.py:390]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 09:32:08 [core.py:390]   File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in __init__
ERROR 04-11 09:32:08 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-11 09:32:08 [core.py:390]   File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 71, in __init__
ERROR 04-11 09:32:08 [core.py:390]     self._initialize_kv_caches(vllm_config)
ERROR 04-11 09:32:08 [core.py:390]   File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 136, in _initialize_kv_caches
ERROR 04-11 09:32:08 [core.py:390]     kv_cache_configs = [
ERROR 04-11 09:32:08 [core.py:390]                        ^
ERROR 04-11 09:32:08 [core.py:390]   File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 137, in <listcomp>
ERROR 04-11 09:32:08 [core.py:390]     get_kv_cache_config(vllm_config, kv_cache_spec_one_worker,
ERROR 04-11 09:32:08 [core.py:390]   File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/core/kv_cache_utils.py", line 643, in get_kv_cache_config
ERROR 04-11 09:32:08 [core.py:390]     check_enough_kv_cache_memory(vllm_config, kv_cache_spec, available_memory)
ERROR 04-11 09:32:08 [core.py:390]   File "/environments/quantization_env/lib64/python3.11/site-packages/vllm/v1/core/kv_cache_utils.py", line 490, in check_enough_kv_cache_memor
y
ERROR 04-11 09:32:08 [core.py:390]     raise ValueError(
ERROR 04-11 09:32:08 [core.py:390] ValueError: To serve at least one request with the models's max seq len (32768), (6.12 GiB KV cache is needed, which is larger than the available KV cache
memory (5.63 GiB). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
ERROR 04-11 09:32:08 [core.py:390]
CRITICAL 04-11 09:32:08 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed

Now, I can investigate.

@DarkLight1337 FYI

Thanks @paolovic. I can confirm that setting VLLM_USE_V1=0 will fix the problem.

manitadayon avatar Apr 11 '25 23:04 manitadayon

@paolovic @hmellor @DarkLight1337 could you please check this ticket related to vllm version 0.8.3? https://github.com/vllm-project/vllm/issues/16552

hahmad2008 avatar Apr 13 '25 13:04 hahmad2008

+1 . I am using vllm like llm=LLM(model=model_name). I am currently installing vllm 0.73 which is messing with all my other packages due to it's dependencies so I cannot wait for the latest version of VLLM to work properly!

darkness8i8 avatar Apr 14 '25 22:04 darkness8i8