vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[V1] Feedback Thread

Open simon-mo opened this issue 10 months ago β€’ 89 comments

Please leave comments here about your usage of V1, does it work? does it not work? which feature do you need in order to adopt it? any bugs?

For bug report, please file it separately and link the issue here.

For in depth discussion, please feel free to join #sig-v1 in the vLLM Slack workspace.

simon-mo avatar Jan 30 '25 02:01 simon-mo

  • https://github.com/vllm-project/vllm/issues/12567

robertgshaw2-redhat avatar Jan 30 '25 02:01 robertgshaw2-redhat

πŸ‘ I have not done a proper benchmark but V1 feels superior, i.e. higher throughput + lower latency, TTFT. The other thing that I have noticed is that logging has changed Running: 1 reqs, Waiting: 0 reqs, it used to print stats such token/s.

I have encountered a possible higher memory consumption issue, but am overall very pleased with the vllm community's hard work on V1. #12529

wedobetter avatar Jan 30 '25 07:01 wedobetter

Does anyone know about this bug with n>1? Thanks https://github.com/vllm-project/vllm/issues/12584

m-harmonic avatar Jan 30 '25 18:01 m-harmonic

Does anyone know about this bug with n>1? Thanks #12584

Thanks, we are aware and have some ongoing PRs for it.

https://github.com/vllm-project/vllm/pull/10980

robertgshaw2-redhat avatar Jan 30 '25 18:01 robertgshaw2-redhat

I have encountered a possible higher memory consumption issue, but am overall very pleased with the vllm community's hard work on V1.

Logging is in progress. Current main has a lot more and we will maintain compatibility with V0. Thanks!

robertgshaw2-redhat avatar Jan 30 '25 22:01 robertgshaw2-redhat

Quick feedback [VLLM_USE_V1=1]:

  • n > 1 would be nice

  • guided_grammar (or anything guided really) would be nice

dchichkov avatar Jan 30 '25 22:01 dchichkov

Quick feedback [VLLM_USE_V1=1]:

  • n > 1 would be nice
  • guided_grammar (or anything guided really) would be nice

Thanks, both are in progress

robertgshaw2-redhat avatar Jan 31 '25 02:01 robertgshaw2-redhat

are logprobs output (and specifically prompt logprobs with echo=True) expected to be working with current V1 (0.7.0)? checking here before opening an issue to reproduce

hibukipanim avatar Jan 31 '25 14:01 hibukipanim

Maybe there is a better place to discuss this but the implementation for models that use more than one extra modality is quite non-intuitive. get_multimodal_embeddings() expects that we return a list or tensor of length equal to the number of multimodal items provided in the batch and we then have to make unintuitive assumptions on how the output passed into get_input_embeddings would look like because the batching being used while calling both functions is not the same. It would be much nicer if for example the input and output of get_multimodal_embeddings are dicts with the keys being the different modalities.

akshay-loci avatar Jan 31 '25 15:01 akshay-loci

are logprobs output (and specifically prompt logprobs with echo=True) expected to be working with current V1 (0.7.0)? checking here before opening an issue to reproduce

Still in progress

robertgshaw2-redhat avatar Jan 31 '25 23:01 robertgshaw2-redhat

πŸ‘ I have not done a proper benchmark but V1 feels superior, i.e. higher throughput + lower latency, TTFT. The other thing that I have noticed is that logging has changed Running: 1 reqs, Waiting: 0 reqs, it used to print stats such token/s.

I have encountered a possible higher memory consumption issue, but am overall very pleased with the vllm community's hard work on V1. #12529

Thanks for fixing metrics logs in 0.7.1! Lack of pipeline parallelism in V1 is a show stopper for production deployments #11945

wedobetter avatar Feb 02 '25 14:02 wedobetter

I'm either going insane, but with V1 qwen 8b instruct LLM just breaks in fp8 and around 25% of generations are just gibberish, with same running code and everything. Do I need to make a bug report, or it's an expected behaviour and I need some specific setup of sampling params for it to work in v1?

Ouna-the-Dataweaver avatar Feb 03 '25 04:02 Ouna-the-Dataweaver

The V1 engine doesn't seem to support logits processors or min-p filtering. Issue #12678

FrederickVu avatar Feb 03 '25 06:02 FrederickVu

Something is weird with memory calculation in V1 and tensor parallel. Here are 2 cases that I tested recently:

vllm 0.7.0 on 2x A6000:

Starting normally a 32b-awq model and using --max-model-len 32768 --gpu-memory-utilization 0.98 --tensor-parallel 2 --max-num-batched-tokens 32768 --max-seq-len-to-capture 32768

Everything works as previously, GPUs both get to ~44-46GB usage

Using VLLM_USE_V1=1 and the exact same parameters as above:

GPUs both load up to ~24-25GB and it slowly goes up as inference runs. I've seen it go up to 32GB on each GPU.

Updating to vllm 0.7.1 and running a 7b-awq model this time, I also noticed that running the above command "normally" the logs show Maximum concurrency at 44x

Using V1 I get:

INFO 02-02 23:26:19 kv_cache_utils.py:400] Maximum concurrency for 32768 tokens per request: **22.25x**

And finally, with vllm 0.7.0 and 4x L4 loading a 32b-awq model with tp 4 works in "normal mode", but OOMs with V1.

gmonair avatar Feb 03 '25 15:02 gmonair

I did a little experiment with DeepSeek-R1 on 8xH200 GPU.

vLLM 0.7.0 showed the following results with benchmark_serving.py --backend openai --base-url http://0.0.0.0:8000 --dataset-name=random --model deepseek-ai/DeepSeek-R1

  • with VLLM_USE_V1=1 (with --request-rate 4)
Traffic request rate: 4.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1000/1000 [07:53<00:00,  2.11it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  473.62    
Total input tokens:                      1024000   
Total generated tokens:                  119550    
Request throughput (req/s):              2.11      
Output token throughput (tok/s):         252.42    
Total Token throughput (tok/s):          2414.51   
---------------Time to First Token----------------
Mean TTFT (ms):                          100636.33 
Median TTFT (ms):                        103588.53 
P99 TTFT (ms):                           197277.97 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          177.82    
Median TPOT (ms):                        172.14    
P99 TPOT (ms):                           363.05    
---------------Inter-token Latency----------------
Mean ITL (ms):                           173.08    
Median ITL (ms):                         136.46    
P99 ITL (ms):                            575.30    
==================================================
  • without VLLM_USE_V1 (with --request-rate 4)
Traffic request rate: 4.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1000/1000 [05:24<00:00,  3.08it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  324.29    
Total input tokens:                      1024000   
Total generated tokens:                  119163    
Request throughput (req/s):              3.08      
Output token throughput (tok/s):         367.46    
Total Token throughput (tok/s):          3525.12   
---------------Time to First Token----------------
Mean TTFT (ms):                          29022.37  
Median TTFT (ms):                        32492.50  
P99 TTFT (ms):                           54457.59  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          125.16    
Median TPOT (ms):                        119.91    
P99 TPOT (ms):                           411.21    
---------------Inter-token Latency----------------
Mean ITL (ms):                           120.20    
Median ITL (ms):                         76.78     
P99 ITL (ms):                            656.11    
==================================================

In general, vLLM without VLLM_USE_V1 looked more productive. I also tried V0 with --request-rate 10 and got

Traffic request rate: 10.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1000/1000 [05:16<00:00,  3.16it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  316.20    
Total input tokens:                      1024000   
Total generated tokens:                  119448    
Request throughput (req/s):              3.16      
Output token throughput (tok/s):         377.76    
Total Token throughput (tok/s):          3616.21   
---------------Time to First Token----------------
Mean TTFT (ms):                          100122.09 
Median TTFT (ms):                        98699.05  
P99 TTFT (ms):                           201732.11 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          139.61    
Median TPOT (ms):                        104.30    
P99 TPOT (ms):                           1276.91   
---------------Inter-token Latency----------------
Mean ITL (ms):                           105.90    
Median ITL (ms):                         76.35     
P99 ITL (ms):                            648.36    
==================================================

Throughput was still 2 times lower than SGLang in the same benchmark. Today I updated vLLM to the new version (0.7.1) and decided to repeat the experiment. And the results in version V0 have become much better!

  • without VLLM_USE_V1 (with --request-rate 4)
Traffic request rate: 4.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1000/1000 [04:29<00:00,  3.71it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  269.74    
Total input tokens:                      1024000   
Total generated tokens:                  119805    
Request throughput (req/s):              3.71      
Output token throughput (tok/s):         444.14    
Total Token throughput (tok/s):          4240.35   
---------------Time to First Token----------------
Mean TTFT (ms):                          368.78    
Median TTFT (ms):                        269.07    
P99 TTFT (ms):                           3826.70   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          124.95    
Median TPOT (ms):                        122.03    
P99 TPOT (ms):                           214.93    
---------------Inter-token Latency----------------
Mean ITL (ms):                           123.32    
Median ITL (ms):                         75.30     
P99 ITL (ms):                            583.77    
==================================================
  • without VLLM_USE_V1 (with --request-rate 10)
Traffic request rate: 10.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1000/1000 [02:26<00:00,  6.83it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  146.43    
Total input tokens:                      1024000   
Total generated tokens:                  119701    
Request throughput (req/s):              6.83      
Output token throughput (tok/s):         817.48    
Total Token throughput (tok/s):          7810.75   
---------------Time to First Token----------------
Mean TTFT (ms):                          14575.11  
Median TTFT (ms):                        13606.50  
P99 TTFT (ms):                           29954.96  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          297.01    
Median TPOT (ms):                        282.46    
P99 TPOT (ms):                           1393.69   
---------------Inter-token Latency----------------
Mean ITL (ms):                           262.67    
Median ITL (ms):                         132.89    
P99 ITL (ms):                            2840.40   
==================================================

But running vLLM with VLLM_USE_V1=1 I got en error TypeError: FlashAttentionImpl.__init__() got an unexpected keyword argument 'q_lora_rank' with previous warnings like

`torch.compile` is turned on, but the model deepseek-ai/DeepSeek-R1 does not support it. Please open an issue on GitHubif you want it to be supported.

Xarbirus avatar Feb 03 '25 16:02 Xarbirus

v1 not support T4,are you support?

bao231 avatar Feb 04 '25 06:02 bao231

@simon-mo

bao231 avatar Feb 04 '25 08:02 bao231

Hi @bao231, V1 does not support T4 or older-generation GPUs since the kernel libraries used in V1 (e.g., flash-attn) do not support them.

WoosukKwon avatar Feb 04 '25 09:02 WoosukKwon

V1 support other attention libs?has you plan? @WoosukKwon

bao231 avatar Feb 04 '25 10:02 bao231

I did a little experiment with DeepSeek-R1 on 8xH200 GPU.

vLLM 0.7.0 showed the following results with benchmark_serving.py --backend openai --base-url http://0.0.0.0:8000 --dataset-name=random --model deepseek-ai/DeepSeek-R1

  • with VLLM_USE_V1=1 (with --request-rate 4)
Traffic request rate: 4.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1000/1000 [07:53<00:00,  2.11it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  473.62    
Total input tokens:                      1024000   
Total generated tokens:                  119550    
Request throughput (req/s):              2.11      
Output token throughput (tok/s):         252.42    
Total Token throughput (tok/s):          2414.51   
---------------Time to First Token----------------
Mean TTFT (ms):                          100636.33 
Median TTFT (ms):                        103588.53 
P99 TTFT (ms):                           197277.97 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          177.82    
Median TPOT (ms):                        172.14    
P99 TPOT (ms):                           363.05    
---------------Inter-token Latency----------------
Mean ITL (ms):                           173.08    
Median ITL (ms):                         136.46    
P99 ITL (ms):                            575.30    
==================================================
  • without VLLM_USE_V1 (with --request-rate 4)
Traffic request rate: 4.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1000/1000 [05:24<00:00,  3.08it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  324.29    
Total input tokens:                      1024000   
Total generated tokens:                  119163    
Request throughput (req/s):              3.08      
Output token throughput (tok/s):         367.46    
Total Token throughput (tok/s):          3525.12   
---------------Time to First Token----------------
Mean TTFT (ms):                          29022.37  
Median TTFT (ms):                        32492.50  
P99 TTFT (ms):                           54457.59  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          125.16    
Median TPOT (ms):                        119.91    
P99 TPOT (ms):                           411.21    
---------------Inter-token Latency----------------
Mean ITL (ms):                           120.20    
Median ITL (ms):                         76.78     
P99 ITL (ms):                            656.11    
==================================================

In general, vLLM without VLLM_USE_V1 looked more productive. I also tried V0 with --request-rate 10 and got

Traffic request rate: 10.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1000/1000 [05:16<00:00,  3.16it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  316.20    
Total input tokens:                      1024000   
Total generated tokens:                  119448    
Request throughput (req/s):              3.16      
Output token throughput (tok/s):         377.76    
Total Token throughput (tok/s):          3616.21   
---------------Time to First Token----------------
Mean TTFT (ms):                          100122.09 
Median TTFT (ms):                        98699.05  
P99 TTFT (ms):                           201732.11 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          139.61    
Median TPOT (ms):                        104.30    
P99 TPOT (ms):                           1276.91   
---------------Inter-token Latency----------------
Mean ITL (ms):                           105.90    
Median ITL (ms):                         76.35     
P99 ITL (ms):                            648.36    
==================================================

Throughput was still 2 times lower than SGLang in the same benchmark. Today I updated vLLM to the new version (0.7.1) and decided to repeat the experiment. And the results in version V0 have become much better!

  • without VLLM_USE_V1 (with --request-rate 4)
Traffic request rate: 4.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1000/1000 [04:29<00:00,  3.71it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  269.74    
Total input tokens:                      1024000   
Total generated tokens:                  119805    
Request throughput (req/s):              3.71      
Output token throughput (tok/s):         444.14    
Total Token throughput (tok/s):          4240.35   
---------------Time to First Token----------------
Mean TTFT (ms):                          368.78    
Median TTFT (ms):                        269.07    
P99 TTFT (ms):                           3826.70   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          124.95    
Median TPOT (ms):                        122.03    
P99 TPOT (ms):                           214.93    
---------------Inter-token Latency----------------
Mean ITL (ms):                           123.32    
Median ITL (ms):                         75.30     
P99 ITL (ms):                            583.77    
==================================================
  • without VLLM_USE_V1 (with --request-rate 10)
Traffic request rate: 10.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1000/1000 [02:26<00:00,  6.83it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  146.43    
Total input tokens:                      1024000   
Total generated tokens:                  119701    
Request throughput (req/s):              6.83      
Output token throughput (tok/s):         817.48    
Total Token throughput (tok/s):          7810.75   
---------------Time to First Token----------------
Mean TTFT (ms):                          14575.11  
Median TTFT (ms):                        13606.50  
P99 TTFT (ms):                           29954.96  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          297.01    
Median TPOT (ms):                        282.46    
P99 TPOT (ms):                           1393.69   
---------------Inter-token Latency----------------
Mean ITL (ms):                           262.67    
Median ITL (ms):                         132.89    
P99 ITL (ms):                            2840.40   
==================================================

But running vLLM with VLLM_USE_V1=1 I got en error TypeError: FlashAttentionImpl.__init__() got an unexpected keyword argument 'q_lora_rank' with previous warnings like

`torch.compile` is turned on, but the model deepseek-ai/DeepSeek-R1 does not support it. Please open an issue on GitHubif you want it to be supported.

Thanks!

  • We are aware of the performance gap for DeepSeekV3 and are actively working on it. See https://github.com/vllm-project/vllm/pull/12676 which will resolve the gap. We will do a release hopefully today with this change
  • DeepSeekV3 is not yet supported on V1 since it requires chunked prefill. We are actively working on chunked prefill for MLA and hope to have it complete this week!

robertgshaw2-redhat avatar Feb 04 '25 14:02 robertgshaw2-redhat

I'm either going insane, but with V1 qwen 8b instruct LLM just breaks in fp8 and around 25% of generations are just gibberish, with same running code and everything. Do I need to make a bug report, or it's an expected behaviour and I need some specific setup of sampling params for it to work in v1?

Can you provide a more detailed reproduction instruction?

cc @WoosukKwon

robertgshaw2-redhat avatar Feb 04 '25 14:02 robertgshaw2-redhat

πŸ‘ I have not done a proper benchmark but V1 feels superior, i.e. higher throughput + lower latency, TTFT. The other thing that I have noticed is that logging has changed Running: 1 reqs, Waiting: 0 reqs, it used to print stats such token/s. I have encountered a possible higher memory consumption issue, but am overall very pleased with the vllm community's hard work on V1. #12529

Thanks for fixing metrics logs in 0.7.1! Lack of pipeline parallelism in V1 is a show stopper for production deployments #11945

Thanks. We are actively working on PP

robertgshaw2-redhat avatar Feb 04 '25 14:02 robertgshaw2-redhat

Maybe there is a better place to discuss this but the implementation for models that use more than one extra modality is quite non-intuitive. get_multimodal_embeddings() expects that we return a list or tensor of length equal to the number of multimodal items provided in the batch and we then have to make unintuitive assumptions on how the output passed into get_input_embeddings would look like because the batching being used while calling both functions is not the same. It would be much nicer if for example the input and output of get_multimodal_embeddings are dicts with the keys being the different modalities.

Check out #sig-multi-modality in our slack! This is the best place for a discussion like this

robertgshaw2-redhat avatar Feb 04 '25 14:02 robertgshaw2-redhat

Something is weird with memory calculation in V1 and tensor parallel. Here are 2 cases that I tested recently:

vllm 0.7.0 on 2x A6000:

Starting normally a 32b-awq model and using --max-model-len 32768 --gpu-memory-utilization 0.98 --tensor-parallel 2 --max-num-batched-tokens 32768 --max-seq-len-to-capture 32768

Everything works as previously, GPUs both get to ~44-46GB usage

Using VLLM_USE_V1=1 and the exact same parameters as above:

GPUs both load up to ~24-25GB and it slowly goes up as inference runs. I've seen it go up to 32GB on each GPU.

Updating to vllm 0.7.1 and running a 7b-awq model this time, I also noticed that running the above command "normally" the logs show Maximum concurrency at 44x

Using V1 I get:

INFO 02-02 23:26:19 kv_cache_utils.py:400] Maximum concurrency for 32768 tokens per request: **22.25x**

And finally, with vllm 0.7.0 and 4x L4 loading a 32b-awq model with tp 4 works in "normal mode", but OOMs with V1.

Its pretty hard to follow what you are seeing. Please attach:

  • launch command
  • logs

Thanks!

robertgshaw2-redhat avatar Feb 04 '25 14:02 robertgshaw2-redhat

Its pretty hard to follow what you are seeing. Please attach:

* launch command

* logs

Hi, please see vllm_output(27)-OOM.log for OOM on 4x L4 and vllm_output(28)-WORKS.log to compare. The only difference between them is the V1 flag.

Launch command

my_env = os.environ.copy()
my_env["VLLM_USE_V1"] = "0"

# background task
command = [
    "python", 
    "-m", 
    "vllm.scripts", 
    "serve",
    "/kaggle/input/qwen25/transformers/r1-32b-awq/1",
    "--served-model-name", "model",
    "--tensor_parallel_size", "4",
    "--gpu_memory_utilization", "0.95",
    "--port", "9901",
    "--max-num-batched-tokens", "32768",
    "--max-seq-len-to-capture", "32768",
    "--max-model-len", "32768",
    "--enable_prefix_caching",
]

process = subprocess.Popen(command, stdout=log_file, stderr=log_file, env=my_env)

vllm_output(28)-WORKS.log vllm_output(27)-OOM.log

gmonair avatar Feb 04 '25 15:02 gmonair

I ran the following code after upgrading the V1 version vllm and encountered an error: import subprocess import os my_env = os.environ.copy() my_env["VLLM_USE_V1"] = "1" command = [ "python", "-m", "vllm.scripts", "serve", "./pretrained/intervl2-8B", "--served-model-name", "intervl2-8B", "--tensor_parallel_size", "2", "--limit-mm-per-prompt","image=10" , "--pipeline-parallel-size","1", "--gpu_memory_utilization", "0.9", "--port", "40004", "--max-num-batched-tokens", "10000", "--max-seq-len-to-capture", "10000", "--max-model-len", "10000", "--enable_prefix_caching", "--trust_remote_code" ] process = subprocess.Popen(command, env=my_env)

Image

However, if --tensor_parallel_size" is set to 1, it works fine. Is there a compatibility issue with the v1 version with the multi-card deployment model?

caoyang-lqp avatar Feb 05 '25 08:02 caoyang-lqp

With dual rtx3090 in V1: VLLM_USE_V1=1 REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt CUDA_DEVICE_ORDER=PCI_BUS_ID OMP_NUM_THREADS=1 CUDA_VISIBLE_DEVICES=0,1 vllm serve kosbu/QVQ-72B-Preview-AWQ --tensor-parallel-size 2 --gpu-memory-utilization 0.99 --api-key aaaaa --max-model-len 7000 --quantization=awq_marlin --enforce-eager

CUDA out of memory. Tried to allocate 594.00 MiB. GPU 0 has a total capacity of 23.48 GiB of which 587.38 MiB is free. Including non-PyTorch memory, this process has 22.89 GiB memory in use. Of the allocated memory 21.56 GiB is allocated by PyTorch, and 815.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation

With v0 it works, something changed about memory in V1.

rstanislav avatar Feb 05 '25 21:02 rstanislav

Will V1 support flashinfer in the future?

JaheimLee avatar Feb 06 '25 11:02 JaheimLee

Does V1 support FP8 (W8A8) quantization?

I tried nm-testing/Qwen2-VL-7B-Instruct-FP8-dynamic on v0.7.1 V1 arch, no error thrown but got gibberish result. Same code and model works properly on v0.7.1 V0 arch.


UPDATE: it works on v0.7.1 V1 arch eager mode, but borken on v0.7.1 V1 arch torch.compiled mode. I'm figuring out if this problem is model-dependent or not.

UPDATE: tried another model nm-testing/DeepSeek-R1-Distill-Qwen-14B-FP8-Dynamic and same bug presents on v0.7.1 V1 arch torch.compiled mode


UPDATE: it works after i turned custom_ops on (change "none" to "all")

https://github.com/vllm-project/vllm/blob/3ee696a63dd0c2acee44809a3bedec33ea27dfa0/vllm/config.py#L3237-L3249

imkero avatar Feb 06 '25 12:02 imkero

When I tested the fine-tuned Qwen2.5_VL_3B model service using v1 mode (by setting the environment variable VLLM_USE_V1=1) and the default mode in OpenAI-compatible mode, I found inconsistencies in the output results.

I tested two samples: β€’ First sample: In v1 mode, the output was less than half of the expected result, while the default mode produced the complete output. β€’ Second sample: In v1 mode, the output was mostly complete but contained many obvious errors, whereas the default mode was correct and complete.

I conducted the same comparative experiment on Qwen2VL, and both v1 and default modes produced correct outputs.

Has anyone else encountered a similar issue? If so, could this indicate a compatibility issue between v1 mode and Qwen2.5_VL_3B?

lyhh123 avatar Feb 08 '25 08:02 lyhh123