nano-vllm
nano-vllm copied to clipboard
Add Serving Benchmark Script
This PR introduces a new benchmark script, serving_bench.py, to evaluate the engine's performance under a continuous load of incoming requests, simulating a real-world serving scenario.
Note: This PR is purely additive. No core files have been modified.
Key Features of serving_bench.py
- Simulates Online Serving: Models a request stream using a Poisson distribution.
- Comprehensive Metrics: Measures throughput, Time To First Token (TTFT), Time Per Output Token (TPOT), and end-to-end latency.
- Live Progress: Uses
tqdmto display real-time progress and average latency. - Configurable: Allows setting the request rate and total number of requests via command-line arguments.
Benchmark Results
The following results demonstrate the system's performance under different request rates (1 L20 48GB GPU, Qwen3-0.6B).
| Request Rate (req/s) | Throughput (tok/s) | Avg TTFT (ms) | Avg TPOT (ms/tok) | Avg Latency (s) |
|---|---|---|---|---|
| 4 | 2046.27 | 87.56 | 5.74 | 3.06 |
| 8 | 3636.29 | 102.46 | 10.99 | 5.85 |
| 16 | 4205.13 | 142.40 | 18.07 | 9.56 |
| 32 | 4631.52 | 353.51 | 27.45 | 14.20 |
The results show that throughput scales effectively with the request rate, which validates the dynamic batching mechanism. As expected, higher throughput is achieved at the cost of increased latency.
How to Use
# Run the benchmark with a specific request rate
python serving_bench.py --request-rate 16 --num-requests 256