vllm
vllm copied to clipboard
Add output streaming support to multi-step + async
This PR adds output streaming support to multi-step + async. A first naive implementation of streaming with multi-step resulted in a significant performance degradation (almost 2x slower tpot), and after some investigation, we have found that the key bottleneck for slowdown is the repeated generation of request outputs. To solve this problem, this PR introduces an incremental/delta generation of request outputs for each sequence group, so that each sequence's decode iteration only the changes are sent (and not the whole request_output).
The implementation is done above MQLLMEngine and it depends on https://github.com/vllm-project/vllm/pull/8157 landing first.
Performance results for Llama 3.1 8B on H100 GPU with ShareGPT show that streaming introduces only 6% penalty for TPOT and improves ITL by 7.8X.
Multi-step + async + no streaming
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 23.56
Total input tokens: 215196
Total generated tokens: 197521
Request throughput (req/s): 42.44
Output token throughput (tok/s): 8382.12
Total Token throughput (tok/s): 17514.31
---------------Time to First Token----------------
Mean TTFT (ms): 7615.81
Median TTFT (ms): 7257.60
P99 TTFT (ms): 16401.42
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 26.01
Median TPOT (ms): 22.20
P99 TPOT (ms): 126.35
---------------Inter-token Latency----------------
Mean ITL (ms): 166.52
Median ITL (ms): 169.27
P99 ITL (ms): 505.28
==================================================
Multi-step + async + with streaming
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 23.81
Total input tokens: 215196
Total generated tokens: 197520
Request throughput (req/s): 41.99
Output token throughput (tok/s): 8294.15
Total Token throughput (tok/s): 17330.55
---------------Time to First Token----------------
Mean TTFT (ms): 7500.72
Median TTFT (ms): 7002.99
P99 TTFT (ms): 15986.18
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 27.78
Median TPOT (ms): 22.95
P99 TPOT (ms): 145.44
---------------Inter-token Latency----------------
Mean ITL (ms): 21.30
Median ITL (ms): 14.17
P99 ITL (ms): 193.40
==================================================