vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Add output streaming support to multi-step + async

Open alexm-redhat opened this issue 5 months ago • 2 comments

This PR adds output streaming support to multi-step + async. A first naive implementation of streaming with multi-step resulted in a significant performance degradation (almost 2x slower tpot), and after some investigation, we have found that the key bottleneck for slowdown is the repeated generation of request outputs. To solve this problem, this PR introduces an incremental/delta generation of request outputs for each sequence group, so that each sequence's decode iteration only the changes are sent (and not the whole request_output).

The implementation is done above MQLLMEngine and it depends on https://github.com/vllm-project/vllm/pull/8157 landing first.

Performance results for Llama 3.1 8B on H100 GPU with ShareGPT show that streaming introduces only 6% penalty for TPOT and improves ITL by 7.8X.

Multi-step + async + no streaming

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  23.56     
Total input tokens:                      215196    
Total generated tokens:                  197521    
Request throughput (req/s):              42.44     
Output token throughput (tok/s):         8382.12   
Total Token throughput (tok/s):          17514.31  
---------------Time to First Token----------------
Mean TTFT (ms):                          7615.81   
Median TTFT (ms):                        7257.60   
P99 TTFT (ms):                           16401.42  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          26.01     
Median TPOT (ms):                        22.20     
P99 TPOT (ms):                           126.35    
---------------Inter-token Latency----------------
Mean ITL (ms):                           166.52    
Median ITL (ms):                         169.27    
P99 ITL (ms):                            505.28    
==================================================

Multi-step + async + with streaming

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  23.81     
Total input tokens:                      215196    
Total generated tokens:                  197520    
Request throughput (req/s):              41.99     
Output token throughput (tok/s):         8294.15   
Total Token throughput (tok/s):          17330.55  
---------------Time to First Token----------------
Mean TTFT (ms):                          7500.72   
Median TTFT (ms):                        7002.99   
P99 TTFT (ms):                           15986.18  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          27.78     
Median TPOT (ms):                        22.95     
P99 TPOT (ms):                           145.44    
---------------Inter-token Latency----------------
Mean ITL (ms):                           21.30     
Median ITL (ms):                         14.17     
P99 ITL (ms):                            193.40    
==================================================

alexm-redhat avatar Sep 10 '24 13:09 alexm-redhat