nano-vllm icon indicating copy to clipboard operation
nano-vllm copied to clipboard

Add Serving Benchmark Script

Open tiannuo-yang opened this issue 5 months ago • 2 comments

This PR introduces a new benchmark script, serving_bench.py, to evaluate the engine's performance under a continuous load of incoming requests, simulating a real-world serving scenario.

Note: This PR is purely additive. No core files have been modified.

Key Features of serving_bench.py

  • Simulates Online Serving: Models a request stream using a Poisson distribution.
  • Comprehensive Metrics: Measures throughput, Time To First Token (TTFT), Time Per Output Token (TPOT), and end-to-end latency.
  • Live Progress: Uses tqdm to display real-time progress and average latency.
  • Configurable: Allows setting the request rate and total number of requests via command-line arguments.

Benchmark Results

The following results demonstrate the system's performance under different request rates (1 L20 48GB GPU, Qwen3-0.6B).

Request Rate (req/s) Throughput (tok/s) Avg TTFT (ms) Avg TPOT (ms/tok) Avg Latency (s)
4 2046.27 87.56 5.74 3.06
8 3636.29 102.46 10.99 5.85
16 4205.13 142.40 18.07 9.56
32 4631.52 353.51 27.45 14.20

The results show that throughput scales effectively with the request rate, which validates the dynamic batching mechanism. As expected, higher throughput is achieved at the cost of increased latency.

How to Use

# Run the benchmark with a specific request rate
python serving_bench.py --request-rate 16 --num-requests 256

tiannuo-yang avatar Jun 21 '25 06:06 tiannuo-yang