vllm [Feature] Consolidate performance benchmark datasets

Addressing #13351

Benchmark Serving Results

after the change

Dataset	Backend	Successful requests	Benchmark duration (s)	Total input tokens
sonnet	openai-chat	100	4.15	54541
hf-vision-arena	openai-chat	100	11.06	16589
hf	openai-chat	100	9.73	1166
sonnet	vllm	100	3.76	54541
sharegpt	vllm	100	8.48	23260
random	vllm	100	4.96	102400
burstgpt	vllm	100	21.84	77561

before the change

Dataset	Backend	Successful requests	Benchmark duration (s)	Total input tokens
sonnet	openai-chat	100	4.01	54541
hf-vision-arena	openai-chat	100	11.21	16589
hf	openai-chat	100	9.83	1166
sonnet	vllm	100	3.75	54541
sharegpt	vllm	100	8.42	23260
random	vllm	100	4.84	102400
burstgpt	vllm	100	21.66	77561

MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct"
  "python3 benchmarks/benchmark_serving.py --backend openai-chat --model ${MODEL_NAME} --endpoint /v1/chat/completions --dataset-name sonnet --dataset-path benchmarks/sonnet.txt --num-prompts ${NUM_PROMPTS}"
  "python3 benchmarks/benchmark_serving.py --model ${MODEL_NAME} --backend openai-chat --endpoint /v1/chat/completions --dataset-name hf --dataset-path lmarena-ai/vision-arena-bench-v0.1 --hf-split train --num-prompts ${NUM_PROMPTS} --request-rate 1000 --percentile-metrics ttft,tpot,e2el"
  "python3 benchmarks/benchmark_serving.py --model ${MODEL_NAME} --backend openai-chat --endpoint /v1/chat/completions --dataset-name hf --dataset-path lmms-lab/LLaVA-OneVision-Data --hf-split train --hf-subset \"chart2text(cauldron)\" --num-prompts ${NUM_PROMPTS} --request-rate 1000 --percentile-metrics ttft,tpot,e2el"
  "python3 benchmarks/benchmark_serving.py --backend vllm --model ${MODEL_NAME} --dataset-name sonnet --dataset-path benchmarks/sonnet.txt --num-prompts ${NUM_PROMPTS}"
  "python3 benchmarks/benchmark_serving.py --backend vllm --model ${MODEL_NAME} --dataset-name sharegpt --dataset-path /home/jovyan/data/vllm_benchmark_datasets/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts ${NUM_PROMPTS}"
  "python3 benchmarks/benchmark_serving.py --backend vllm --model ${MODEL_NAME} --dataset-name random --num-prompts ${NUM_PROMPTS}"
  "python3 benchmarks/benchmark_serving.py --backend vllm --model ${MODEL_NAME} --dataset-name burstgpt --dataset-path /home/jovyan/data/vllm_benchmark_datasets/BurstGPT_without_fails_2.csv --num-prompts ${NUM_PROMPTS}"

Benchmark Throughput Results

after the change

Dataset	Processed Prompts	Throughput (requests/s)	Total tokens/s	Output tokens/s
random	10	50.44	1513.07	1008.71
ShareGPT	10	1.66	605.33	378.11
sonnet	10	7.62	4960.96	1142.38
burstgpt	10	2.17	2999.05	406.72

before the change sonnet and burstgpt is not supported

Dataset	Processed Prompts	Throughput (requests/s)	Total tokens/s	Output tokens/s
random	10	51.13	1534.02	1022.68
ShareGPT	10	1.66	604.19	377.39
sonnet	10
burstgpt	10

MODEL="NousResearch/Hermes-3-Llama-3.1-8B"
"VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model $MODEL --input_len 10 --output_len 20 --dataset-name random --num-prompts $NUM_PROMPTS"
"VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model $MODEL --dataset /home/jovyan/vllm/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS"
"VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model $MODEL --dataset-name sonnet --dataset benchmarks/sonnet.txt --num-prompts $NUM_PROMPTS"
"VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model $MODEL --dataset /home/jovyan/data/vllm_benchmark_datasets/BurstGPT_without_fails_2.csv --dataset-name burstgpt --num-prompts $NUM_PROMPTS"

Benchmark Throughput Results - Image Support

command copied from here #9851 Since the coco dataset is too large, I used random images here.

random_array = np.random.randint(0, 256, (256, 256, 3), dtype=np.uint8)
mm_data["image"] = Image.fromarray(random_array)

after change 1000 request Throughput: 15.47 requests/s, 3304.12 total tokens/s, 3048.79 output tokens/s before change 1000 request Throughput: 14.90 requests/s, 3183.03 total tokens/s, 2937.06 output tokens/s

python benchmarks/benchmark_throughput.py \
    --model mistral-community/pixtral-12b \
    --max-model-len=8192 \
    --dataset sharegpt4v_instruct_gpt4-vision_cap100k.json

LoRA request test

commands are copied from this PR #11267 after the change

Dataset	Num Prompts	Max Loras	Max Lora Rank	Enable Lora	Async Engine	Throughput (requests/s)	Total tokens/s	Output tokens/s
ShareGPT	1000	1	8	Yes	No	11.66	5610.75	2742.90
ShareGPT	1000	4	8	Yes	No	11.59	5575.73	2725.78
ShareGPT	1000	N/A	N/A	No	Yes	17.42	8383.51	4098.41
ShareGPT	1000	1	8	Yes	Yes	11.50	5535.98	2706.35
ShareGPT	1000	4	8	Yes	Yes	11.25	5412.76	2646.11

before the change

Dataset	Num Prompts	Max Loras	Max Lora Rank	Enable Lora	Async Engine	Throughput (requests/s)	Total tokens/s	Output tokens/s
ShareGPT	1000	1	8	Yes	No	10.84	5216.17	2550.01
ShareGPT	1000	4	8	Yes	No	10.80	5197.68	2540.97
ShareGPT	1000	N/A	N/A	No	Yes	16.75	8061.23	3940.86
ShareGPT	1000	1	8	Yes	Yes	11.08	5332.47	2606.86
ShareGPT	1000	4	8	Yes	Yes	10.84	5215.25	2549.56
ShareGPT	1000	4	8	Yes	Yes	10.84	5215.25	2549.56

  "python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS --max-loras 1 --max-lora-rank 8 --enable-lora --lora-path \"yard1/llama-2-7b-sql-lora-test\""
  "python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS --max-loras 4 --max-lora-rank 8 --enable-lora --lora-path \"yard1/llama-2-7b-sql-lora-test\""
  "python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS --async-engine"
  "python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS --async-engine --max-loras 1 --max-lora-rank 8 --enable-lora --lora-path \"yard1/llama-2-7b-sql-lora-test\""
  "python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS --async-engine --max-loras 4 --max-lora-rank 8 --enable-lora --lora-path \"yard1/llama-2-7b-sql-lora-test\""

RandomDataSet throughput test

after the change This is changed to the random sampling defined in benchmark_serving.py

Dataset	Processed Prompts	Throughput (requests/s)	Total tokens/s	Output tokens/s	Range Ratio	Prefix Len	Input Len	Output Len
random	10	51.92	1277.24	752.85	0.5	2	10	20
random	10	39.35	1188.51	830.39	0.5	2	10	30
random	10	51.20	1689.73	834.62	0.5	2	20	20
random	10	36.60	1463.81	786.80	0.5	2	20	30
random	10	51.89	1660.44	1037.78	1.0	2	10	20

before the change This is the original random sampling defined in benchmark_throughput.py. Range ratio and prefix len is not needed in the original throughput test's random sampling.

Dataset	Processed Prompts	Throughput (requests/s)	Total tokens/s	Output tokens/s	Input Len	Output Len
random	10	53.07	1592.00	1061.33	10	20
random	10	37.15	1486.08	1114.56	10	30
random	10	51.30	2052.04	1026.02	20	20
random	10	37.23	1861.38	1116.83	20	30

# parameters is defined in the table above
VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model NousResearch/Hermes-3-Llama-3.1-8B --dataset-name random --num-prompts 10 --prefix-len 2 --random-range-ratio 1.0 --input-len 10 --output-len 20

scripts for generating the table above is here

Feb 28 '25 10:02 JenZhao

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Feb 28 '25 10:02 github-actions[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @JenZhao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Mar 03 '25 01:03 mergify[bot]

will fix lora and re-run test later.

Mar 08 '25 07:03 JenZhao

latest testing - checking why the sharegpt one looks different now

benchmark_serving.py main branch

Dataset	Backend	Successful requests	Benchmark duration (s)	Total input tokens
sonnet	openai-chat	10	1.54	5409
hf-vision-arena	openai-chat	10	2.60	7191
hf	openai-chat	10	1.52	115
sonnet	vllm	10	1.53	5409
sharegpt	vllm	10	6.76	1374
random	vllm	10	1.49	10240
burstgpt	vllm	10	5.93	11970

benchmark_serving.py latest change

Dataset	Backend	Successful requests	Benchmark duration (s)	Total input tokens
sonnet	openai-chat	10	1.49	5409
hf-vision-arena	openai-chat	10	2.26	7191
hf	openai-chat	10	1.26	115
sonnet	vllm	10	1.50	5409
sharegpt	vllm	10	1.14	1960
random	vllm	10	1.45	10240
burstgpt	vllm	10	5.84	11970

benchmark_throughput.py main branch

main branch only has random and sharegpt dataset, its random dataset generation is different now

Dataset	Processed Prompts	Throughput (requests/s)	Total tokens/s	Output tokens/s
random	10	6.83	7867.60	874.18
ShareGPT_V3_unfiltered_cleaned_split.json	10	1.60	725.71	345.22
sonnet	10
burstgpt	10

benchmark_throughput.py latest change

Dataset	Processed Prompts	Throughput (requests/s)	Total tokens/s	Output tokens/s
random	10	6.33	7297.28	810.81
ShareGPT_V3_unfiltered_cleaned_split.json	10	3.25	1639.56	1152.57
sonnet	10	6.91	4557.63	1035.98
burstgpt	10	2.06	2853.16	386.93

Mar 09 '25 03:03 JenZhao

testing again

Throughput Benchmark Results, this branch

Dataset	Processed Prompts	Throughput (requests/s)	Total tokens/s	Output tokens/s
random	10	6.80	7834.99	870.55
ShareGPT_V3_unfiltered_cleaned_split.json	10	3.44	883.61	434.41
sonnet	10	6.85	4496.04	1027.27
burstgpt	10	2.10	2906.03	394.10

Throughput Benchmark Results, main branch

The main branch does not support Sonnet and BurstGPT.
The main branch’s random dataset definition is different from that of this branch; this branch uses the random dataset definition in the serving script.
This branch is also using serving script's sharegpt sampling. There is some very minor difference.

Dataset	Processed Prompts	Throughput (requests/s)	Total tokens/s	Output tokens/s
random	10	6.83	7867.60	874.18
ShareGPT_V3_unfiltered_cleaned_split.json	10	1.60	725.71	345.22
sonnet	10
burstgpt	10

Serving Benchmark Results, this branch

Dataset	Backend	Successful requests	Benchmark duration (s)	Total input tokens
sonnet	openai-chat	10	1.53	5409
hf-vision-arena	openai-chat	10	2.61	7191
hf	openai-chat	10	1.53	115
sonnet	vllm	10	1.54	5409
sharegpt	vllm	10	6.53	1374
random	vllm	10	1.48	10240
burstgpt	vllm	10	5.92	11970

Serving Benchmark Results, main branch

Dataset	Backend	Successful requests	Benchmark duration (s)	Total input tokens
sonnet	openai-chat	10	1.54	5409
hf-vision-arena	openai-chat	10	2.60	7191
hf	openai-chat	10	1.52	115
sonnet	vllm	10	1.53	5409
sharegpt	vllm	10	6.76	1374
random	vllm	10	1.49	10240
burstgpt	vllm	10	5.93	11970

Lora Benchmark Results, this branch

Dataset	Num Prompts	Max Loras	Max Lora Rank	Enable Lora	Async Engine	Throughput (requests/s)	Total tokens/s	Output tokens/s
ShareGPT_V...	10	1	8	Yes	No	1.72	705.30	327.45
ShareGPT_V...	10	4	8	Yes	No	1.34	663.50	230.17
ShareGPT_V...	10	N/A	N/A	No	Yes	1.94	1034.53	365.42
ShareGPT_V...	10	1	8	Yes	Yes	2.40	1014.93	379.04
ShareGPT_V...	10	4	8	Yes	Yes	0.99	633.56	359.78

Lora Benchmark Results, main branch

Dataset	Num Prompts	Max Loras	Max Lora Rank	Enable Lora	Async Engine	Throughput (requests/s)	Total tokens/s	Output tokens/s
ShareGPT_V...	10	1	8	Yes	No	1.65	819.24	467.69
ShareGPT_V...	10	4	8	Yes	No	1.41	736.43	240.00
ShareGPT_V...	10	N/A	N/A	No	Yes	1.77	773.02	375.51
ShareGPT_V...	10	1	8	Yes	Yes	1.85	736.69	443.24
ShareGPT_V...	10	4	8	Yes	Yes	1.76	729.75	240.61

Mar 09 '25 05:03 JenZhao

testing with 1000 request

Lora Benchmark Results, this branch

Dataset	Num Prompts	Max Loras	Max Lora Rank	Enable Lora	Async Engine	Throughput (requests/s)	Total tokens/s	Output tokens/s
ShareGPT_V...	1000	1	8	Yes	No	12.05	5755.06	2816.05
ShareGPT_V...	1000	4	8	Yes	No	10.85	5351.16	2539.97
ShareGPT_V...	1000	N/A	N/A	No	Yes	16.52	8256.40	3949.34
ShareGPT_V...	1000	1	8	Yes	Yes	11.93	5582.34	2611.23
ShareGPT_V...	1000	4	8	Yes	Yes	11.21	5364.53	2653.42

Lora Benchmark Results, main branch

Dataset	Num Prompts	Max Loras	Max Lora Rank	Enable Lora	Async Engine	Throughput (requests/s)	Total tokens/s	Output tokens/s
ShareGPT_V...	1000	1	8	Yes	No	11.42	5434.65	2654.52
ShareGPT_V...	1000	4	8	Yes	No	10.87	5368.56	2520.88
ShareGPT_V...	1000	N/A	N/A	No	Yes	15.34	7528.82	3769.25
ShareGPT_V...	1000	1	8	Yes	Yes	11.60	5719.71	2573.52
ShareGPT_V...	1000	4	8	Yes	Yes	10.96	5139.18	2462.06

Mar 09 '25 05:03 JenZhao

testing with 1000 request

Throughput Results, this branch

Dataset	Processed Prompts	Throughput (requests/s)	Total tokens/s	Output tokens/s
random	1000	25.18	29006.31	3222.92
ShareGPT_V3_unfiltered_cleaned_split.json	1000	39.72	16360.76	7468.34
sonnet	1000	50.81	33423.33	7622.20
burstgpt	1000	14.06	15629.87	4817.21

Throughput Results, main branch

Dataset	Processed Prompts	Throughput (requests/s)	Total tokens/s	Output tokens/s
random	1000	26.14	30113.14	3345.90
ShareGPT_V3_unfiltered_cleaned_split.json	1000	37.88	15570.40	7188.78
sonnet	1000
burstgpt	1000

Mar 09 '25 06:03 JenZhao

Serving Results, this branch

Dataset	Backend	Successful requests	Benchmark duration (s)	Total input tokens
sonnet	openai-chat	1000	30.81	546875
hf-vision-arena	openai-chat	500	63.45	33418
hf	openai-chat	1000	90.91	11428
sonnet	vllm	1000	30.43	546875
sharegpt	vllm	1000	34.47	217393
random	vllm	1000	42.87	1024000
burstgpt	vllm	1000	101.00	768960

Serving Results, main branch

Dataset	Backend	Successful requests	Benchmark duration (s)	Total input tokens
sonnet	openai-chat	1000	32.32	546875
hf-vision-arena	openai-chat	500	64.04	33418
hf	openai-chat	1000	478.90	11428
sonnet	vllm	1000	37.38	546875
sharegpt	vllm	1000	37.34	217393
random	vllm	1000	59.84	1024000
burstgpt	vllm	1000	109.34	768960

Mar 09 '25 07:03 JenZhao

Thoughput Results, this branch

sharegpt does not match, will look into this later.

Dataset	Processed Prompts	Total Prompt Tokens	Total Tokens	Total Output Tokens	Requests/s	Total Tokens/s	Output Tokens/s
random	10	10240	11520	1280	6.74	7765.54	862.84
ShareGPT_V3_unfiltered_cleaned_split.json	10	1798	3710	1912	2.75	1021.74	526.57
sonnet	10	5089	6589	1500	6.93	4563.60	1038.91
burstgpt	10	11970	13848	1878	2.05	2839.61	385.09

Throughput Results, main branch

Dataset	Processed Prompts	Total Prompt Tokens	Total Tokens	Total Output Tokens	Requests/s	Total Tokens/s	Output Tokens/s
random	10	10240	11520	1280	6.85	7896.75	877.42
ShareGPT_V3_unfiltered_cleaned_split.json	10	2474	3751	1277	2.89	1085.52	369.56
sonnet	10
burstgpt	10

Mar 09 '25 08:03 JenZhao

Thoughput Results, this branch

sharegpt does not match, will look into this later.

Dataset Processed Prompts Total Prompt Tokens Total Tokens Total Output Tokens Requests/s Total Tokens/s Output Tokens/s random 10 10240 11520 1280 6.74 7765.54 862.84 ShareGPT_V3_unfiltered_cleaned_split.json 10 1798 3710 1912 2.75 1021.74 526.57 sonnet 10 5089 6589 1500 6.93 4563.60 1038.91 burstgpt 10 11970 13848 1878 2.05 2839.61 385.09 Throughput Results, main branch

Dataset Processed Prompts Total Prompt Tokens Total Tokens Total Output Tokens Requests/s Total Tokens/s Output Tokens/s random 10 10240 11520 1280 6.85 7896.75 877.42 ShareGPT_V3_unfiltered_cleaned_split.json 10 2474 3751 1277 2.89 1085.52 369.56 sonnet 10 burstgpt 10

We should try to find out why sampling for ShareGPT is different between main and this branch, since this is actually quite important. Also can you check for 1000 requests?

Mar 09 '25 08:03 ywang96

Thoughput Results, this branch sharegpt does not match, will look into this later. Dataset Processed Prompts Total Prompt Tokens Total Tokens Total Output Tokens Requests/s Total Tokens/s Output Tokens/s random 10 10240 11520 1280 6.74 7765.54 862.84 ShareGPT_V3_unfiltered_cleaned_split.json 10 1798 3710 1912 2.75 1021.74 526.57 sonnet 10 5089 6589 1500 6.93 4563.60 1038.91 burstgpt 10 11970 13848 1878 2.05 2839.61 385.09 Throughput Results, main branch Dataset Processed Prompts Total Prompt Tokens Total Tokens Total Output Tokens Requests/s Total Tokens/s Output Tokens/s random 10 10240 11520 1280 6.85 7896.75 877.42 ShareGPT_V3_unfiltered_cleaned_split.json 10 2474 3751 1277 2.89 1085.52 369.56 sonnet 10 burstgpt 10

We should try to find out why sampling for ShareGPT is different between main and this branch, since this is actually quite important. Also can you check for 1000 requests?

ok they now match after setting the same random seed...

main

Dataset	Processed Prompts	Total Prompt Tokens	Total Tokens	Total Output Tokens	Requests/s	Total Tokens/s	Output Tokens/s
ShareGPT_V3_unfiltered_cleaned_split.json	1000	215196	413539	198343	48.12	19901.29	9545.12

this branch

Dataset	Processed Prompts	Total Prompt Tokens	Total Tokens	Total Output Tokens	Requests/s	Total Tokens/s	Output Tokens/s
ShareGPT_V3_unfiltered_cleaned_split.json	1000	215196	413539	198343	48.57	20084.61	9633.05

Mar 09 '25 10:03 JenZhao

vllm vllm copied to clipboard

[Feature] Consolidate performance benchmark datasets

Benchmark Serving Results

Benchmark Throughput Results

Benchmark Throughput Results - Image Support

LoRA request test

RandomDataSet throughput test

vllm
vllm copied to clipboard