vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Feature] Consolidate performance benchmark datasets

Open JenZhao opened this issue 9 months ago • 2 comments

Addressing #13351

Benchmark Serving Results

after the change

Dataset Backend Successful requests Benchmark duration (s) Total input tokens
sonnet openai-chat 100 4.15 54541
hf-vision-arena openai-chat 100 11.06 16589
hf openai-chat 100 9.73 1166
sonnet vllm 100 3.76 54541
sharegpt vllm 100 8.48 23260
random vllm 100 4.96 102400
burstgpt vllm 100 21.84 77561

before the change

Dataset Backend Successful requests Benchmark duration (s) Total input tokens
sonnet openai-chat 100 4.01 54541
hf-vision-arena openai-chat 100 11.21 16589
hf openai-chat 100 9.83 1166
sonnet vllm 100 3.75 54541
sharegpt vllm 100 8.42 23260
random vllm 100 4.84 102400
burstgpt vllm 100 21.66 77561
MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct"
  "python3 benchmarks/benchmark_serving.py --backend openai-chat --model ${MODEL_NAME} --endpoint /v1/chat/completions --dataset-name sonnet --dataset-path benchmarks/sonnet.txt --num-prompts ${NUM_PROMPTS}"
  "python3 benchmarks/benchmark_serving.py --model ${MODEL_NAME} --backend openai-chat --endpoint /v1/chat/completions --dataset-name hf --dataset-path lmarena-ai/vision-arena-bench-v0.1 --hf-split train --num-prompts ${NUM_PROMPTS} --request-rate 1000 --percentile-metrics ttft,tpot,e2el"
  "python3 benchmarks/benchmark_serving.py --model ${MODEL_NAME} --backend openai-chat --endpoint /v1/chat/completions --dataset-name hf --dataset-path lmms-lab/LLaVA-OneVision-Data --hf-split train --hf-subset \"chart2text(cauldron)\" --num-prompts ${NUM_PROMPTS} --request-rate 1000 --percentile-metrics ttft,tpot,e2el"
  "python3 benchmarks/benchmark_serving.py --backend vllm --model ${MODEL_NAME} --dataset-name sonnet --dataset-path benchmarks/sonnet.txt --num-prompts ${NUM_PROMPTS}"
  "python3 benchmarks/benchmark_serving.py --backend vllm --model ${MODEL_NAME} --dataset-name sharegpt --dataset-path /home/jovyan/data/vllm_benchmark_datasets/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts ${NUM_PROMPTS}"
  "python3 benchmarks/benchmark_serving.py --backend vllm --model ${MODEL_NAME} --dataset-name random --num-prompts ${NUM_PROMPTS}"
  "python3 benchmarks/benchmark_serving.py --backend vllm --model ${MODEL_NAME} --dataset-name burstgpt --dataset-path /home/jovyan/data/vllm_benchmark_datasets/BurstGPT_without_fails_2.csv --num-prompts ${NUM_PROMPTS}"

Benchmark Throughput Results

after the change

Dataset Processed Prompts Throughput (requests/s) Total tokens/s Output tokens/s
random 10 50.44 1513.07 1008.71
ShareGPT 10 1.66 605.33 378.11
sonnet 10 7.62 4960.96 1142.38
burstgpt 10 2.17 2999.05 406.72

before the change sonnet and burstgpt is not supported

Dataset Processed Prompts Throughput (requests/s) Total tokens/s Output tokens/s
random 10 51.13 1534.02 1022.68
ShareGPT 10 1.66 604.19 377.39
sonnet 10
burstgpt 10
MODEL="NousResearch/Hermes-3-Llama-3.1-8B"
"VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model $MODEL --input_len 10 --output_len 20 --dataset-name random --num-prompts $NUM_PROMPTS"
"VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model $MODEL --dataset /home/jovyan/vllm/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS"
"VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model $MODEL --dataset-name sonnet --dataset benchmarks/sonnet.txt --num-prompts $NUM_PROMPTS"
"VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model $MODEL --dataset /home/jovyan/data/vllm_benchmark_datasets/BurstGPT_without_fails_2.csv --dataset-name burstgpt --num-prompts $NUM_PROMPTS"

Benchmark Throughput Results - Image Support

command copied from here #9851 Since the coco dataset is too large, I used random images here.

random_array = np.random.randint(0, 256, (256, 256, 3), dtype=np.uint8)
mm_data["image"] = Image.fromarray(random_array)

after change 1000 request Throughput: 15.47 requests/s, 3304.12 total tokens/s, 3048.79 output tokens/s before change 1000 request Throughput: 14.90 requests/s, 3183.03 total tokens/s, 2937.06 output tokens/s

python benchmarks/benchmark_throughput.py \
    --model mistral-community/pixtral-12b \
    --max-model-len=8192 \
    --dataset sharegpt4v_instruct_gpt4-vision_cap100k.json

LoRA request test

commands are copied from this PR #11267 after the change

Dataset Num Prompts Max Loras Max Lora Rank Enable Lora Async Engine Throughput (requests/s) Total tokens/s Output tokens/s
ShareGPT 1000 1 8 Yes No 11.66 5610.75 2742.90
ShareGPT 1000 4 8 Yes No 11.59 5575.73 2725.78
ShareGPT 1000 N/A N/A No Yes 17.42 8383.51 4098.41
ShareGPT 1000 1 8 Yes Yes 11.50 5535.98 2706.35
ShareGPT 1000 4 8 Yes Yes 11.25 5412.76 2646.11

before the change

Dataset Num Prompts Max Loras Max Lora Rank Enable Lora Async Engine Throughput (requests/s) Total tokens/s Output tokens/s
ShareGPT 1000 1 8 Yes No 10.84 5216.17 2550.01
ShareGPT 1000 4 8 Yes No 10.80 5197.68 2540.97
ShareGPT 1000 N/A N/A No Yes 16.75 8061.23 3940.86
ShareGPT 1000 1 8 Yes Yes 11.08 5332.47 2606.86
ShareGPT 1000 4 8 Yes Yes 10.84 5215.25 2549.56
ShareGPT 1000 4 8 Yes Yes 10.84 5215.25 2549.56
  "python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS --max-loras 1 --max-lora-rank 8 --enable-lora --lora-path \"yard1/llama-2-7b-sql-lora-test\""
  "python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS --max-loras 4 --max-lora-rank 8 --enable-lora --lora-path \"yard1/llama-2-7b-sql-lora-test\""
  "python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS --async-engine"
  "python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS --async-engine --max-loras 1 --max-lora-rank 8 --enable-lora --lora-path \"yard1/llama-2-7b-sql-lora-test\""
  "python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS --async-engine --max-loras 4 --max-lora-rank 8 --enable-lora --lora-path \"yard1/llama-2-7b-sql-lora-test\""

RandomDataSet throughput test

after the change This is changed to the random sampling defined in benchmark_serving.py

Dataset Processed Prompts Throughput (requests/s) Total tokens/s Output tokens/s Range Ratio Prefix Len Input Len Output Len
random 10 51.92 1277.24 752.85 0.5 2 10 20
random 10 39.35 1188.51 830.39 0.5 2 10 30
random 10 51.20 1689.73 834.62 0.5 2 20 20
random 10 36.60 1463.81 786.80 0.5 2 20 30
random 10 51.89 1660.44 1037.78 1.0 2 10 20

before the change This is the original random sampling defined in benchmark_throughput.py. Range ratio and prefix len is not needed in the original throughput test's random sampling.

Dataset Processed Prompts Throughput (requests/s) Total tokens/s Output tokens/s Range Ratio Prefix Len Input Len Output Len
random 10 53.07 1592.00 1061.33 10 20
random 10 37.15 1486.08 1114.56 10 30
random 10 51.30 2052.04 1026.02 20 20
random 10 37.23 1861.38 1116.83 20 30
# parameters is defined in the table above
VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model NousResearch/Hermes-3-Llama-3.1-8B --dataset-name random --num-prompts 10 --prefix-len 2 --random-range-ratio 1.0 --input-len 10 --output-len 20

scripts for generating the table above is here

JenZhao avatar Feb 28 '25 10:02 JenZhao

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

github-actions[bot] avatar Feb 28 '25 10:02 github-actions[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @JenZhao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Mar 03 '25 01:03 mergify[bot]

will fix lora and re-run test later.

JenZhao avatar Mar 08 '25 07:03 JenZhao

latest testing - checking why the sharegpt one looks different now

benchmark_serving.py main branch

Dataset Backend Successful requests Benchmark duration (s) Total input tokens
sonnet openai-chat 10 1.54 5409
hf-vision-arena openai-chat 10 2.60 7191
hf openai-chat 10 1.52 115
sonnet vllm 10 1.53 5409
sharegpt vllm 10 6.76 1374
random vllm 10 1.49 10240
burstgpt vllm 10 5.93 11970

benchmark_serving.py latest change

Dataset Backend Successful requests Benchmark duration (s) Total input tokens
sonnet openai-chat 10 1.49 5409
hf-vision-arena openai-chat 10 2.26 7191
hf openai-chat 10 1.26 115
sonnet vllm 10 1.50 5409
sharegpt vllm 10 1.14 1960
random vllm 10 1.45 10240
burstgpt vllm 10 5.84 11970

benchmark_throughput.py main branch

main branch only has random and sharegpt dataset, its random dataset generation is different now

Dataset Processed Prompts Throughput (requests/s) Total tokens/s Output tokens/s
random 10 6.83 7867.60 874.18
ShareGPT_V3_unfiltered_cleaned_split.json 10 1.60 725.71 345.22
sonnet 10
burstgpt 10

benchmark_throughput.py latest change

Dataset Processed Prompts Throughput (requests/s) Total tokens/s Output tokens/s
random 10 6.33 7297.28 810.81
ShareGPT_V3_unfiltered_cleaned_split.json 10 3.25 1639.56 1152.57
sonnet 10 6.91 4557.63 1035.98
burstgpt 10 2.06 2853.16 386.93

JenZhao avatar Mar 09 '25 03:03 JenZhao

testing again

Throughput Benchmark Results, this branch

Dataset Processed Prompts Throughput (requests/s) Total tokens/s Output tokens/s
random 10 6.80 7834.99 870.55
ShareGPT_V3_unfiltered_cleaned_split.json 10 3.44 883.61 434.41
sonnet 10 6.85 4496.04 1027.27
burstgpt 10 2.10 2906.03 394.10

Throughput Benchmark Results, main branch

  • The main branch does not support Sonnet and BurstGPT.
  • The main branch’s random dataset definition is different from that of this branch; this branch uses the random dataset definition in the serving script.
  • This branch is also using serving script's sharegpt sampling. There is some very minor difference.
Dataset Processed Prompts Throughput (requests/s) Total tokens/s Output tokens/s
random 10 6.83 7867.60 874.18
ShareGPT_V3_unfiltered_cleaned_split.json 10 1.60 725.71 345.22
sonnet 10
burstgpt 10

Serving Benchmark Results, this branch

Dataset Backend Successful requests Benchmark duration (s) Total input tokens
sonnet openai-chat 10 1.53 5409
hf-vision-arena openai-chat 10 2.61 7191
hf openai-chat 10 1.53 115
sonnet vllm 10 1.54 5409
sharegpt vllm 10 6.53 1374
random vllm 10 1.48 10240
burstgpt vllm 10 5.92 11970

Serving Benchmark Results, main branch

Dataset Backend Successful requests Benchmark duration (s) Total input tokens
sonnet openai-chat 10 1.54 5409
hf-vision-arena openai-chat 10 2.60 7191
hf openai-chat 10 1.52 115
sonnet vllm 10 1.53 5409
sharegpt vllm 10 6.76 1374
random vllm 10 1.49 10240
burstgpt vllm 10 5.93 11970

Lora Benchmark Results, this branch

Dataset Num Prompts Max Loras Max Lora Rank Enable Lora Async Engine Throughput (requests/s) Total tokens/s Output tokens/s
ShareGPT_V... 10 1 8 Yes No 1.72 705.30 327.45
ShareGPT_V... 10 4 8 Yes No 1.34 663.50 230.17
ShareGPT_V... 10 N/A N/A No Yes 1.94 1034.53 365.42
ShareGPT_V... 10 1 8 Yes Yes 2.40 1014.93 379.04
ShareGPT_V... 10 4 8 Yes Yes 0.99 633.56 359.78

Lora Benchmark Results, main branch

Dataset Num Prompts Max Loras Max Lora Rank Enable Lora Async Engine Throughput (requests/s) Total tokens/s Output tokens/s
ShareGPT_V... 10 1 8 Yes No 1.65 819.24 467.69
ShareGPT_V... 10 4 8 Yes No 1.41 736.43 240.00
ShareGPT_V... 10 N/A N/A No Yes 1.77 773.02 375.51
ShareGPT_V... 10 1 8 Yes Yes 1.85 736.69 443.24
ShareGPT_V... 10 4 8 Yes Yes 1.76 729.75 240.61

JenZhao avatar Mar 09 '25 05:03 JenZhao

testing with 1000 request

Lora Benchmark Results, this branch

Dataset Num Prompts Max Loras Max Lora Rank Enable Lora Async Engine Throughput (requests/s) Total tokens/s Output tokens/s
ShareGPT_V... 1000 1 8 Yes No 12.05 5755.06 2816.05
ShareGPT_V... 1000 4 8 Yes No 10.85 5351.16 2539.97
ShareGPT_V... 1000 N/A N/A No Yes 16.52 8256.40 3949.34
ShareGPT_V... 1000 1 8 Yes Yes 11.93 5582.34 2611.23
ShareGPT_V... 1000 4 8 Yes Yes 11.21 5364.53 2653.42

Lora Benchmark Results, main branch

Dataset Num Prompts Max Loras Max Lora Rank Enable Lora Async Engine Throughput (requests/s) Total tokens/s Output tokens/s
ShareGPT_V... 1000 1 8 Yes No 11.42 5434.65 2654.52
ShareGPT_V... 1000 4 8 Yes No 10.87 5368.56 2520.88
ShareGPT_V... 1000 N/A N/A No Yes 15.34 7528.82 3769.25
ShareGPT_V... 1000 1 8 Yes Yes 11.60 5719.71 2573.52
ShareGPT_V... 1000 4 8 Yes Yes 10.96 5139.18 2462.06

JenZhao avatar Mar 09 '25 05:03 JenZhao

testing with 1000 request

Throughput Results, this branch

Dataset Processed Prompts Throughput (requests/s) Total tokens/s Output tokens/s
random 1000 25.18 29006.31 3222.92
ShareGPT_V3_unfiltered_cleaned_split.json 1000 39.72 16360.76 7468.34
sonnet 1000 50.81 33423.33 7622.20
burstgpt 1000 14.06 15629.87 4817.21

Throughput Results, main branch

Dataset Processed Prompts Throughput (requests/s) Total tokens/s Output tokens/s
random 1000 26.14 30113.14 3345.90
ShareGPT_V3_unfiltered_cleaned_split.json 1000 37.88 15570.40 7188.78
sonnet 1000
burstgpt 1000

JenZhao avatar Mar 09 '25 06:03 JenZhao

Serving Results, this branch

Dataset Backend Successful requests Benchmark duration (s) Total input tokens
sonnet openai-chat 1000 30.81 546875
hf-vision-arena openai-chat 500 63.45 33418
hf openai-chat 1000 90.91 11428
sonnet vllm 1000 30.43 546875
sharegpt vllm 1000 34.47 217393
random vllm 1000 42.87 1024000
burstgpt vllm 1000 101.00 768960

Serving Results, main branch

Dataset Backend Successful requests Benchmark duration (s) Total input tokens
sonnet openai-chat 1000 32.32 546875
hf-vision-arena openai-chat 500 64.04 33418
hf openai-chat 1000 478.90 11428
sonnet vllm 1000 37.38 546875
sharegpt vllm 1000 37.34 217393
random vllm 1000 59.84 1024000
burstgpt vllm 1000 109.34 768960

JenZhao avatar Mar 09 '25 07:03 JenZhao

Thoughput Results, this branch

sharegpt does not match, will look into this later.

Dataset Processed Prompts Total Prompt Tokens Total Tokens Total Output Tokens Requests/s Total Tokens/s Output Tokens/s
random 10 10240 11520 1280 6.74 7765.54 862.84
ShareGPT_V3_unfiltered_cleaned_split.json 10 1798 3710 1912 2.75 1021.74 526.57
sonnet 10 5089 6589 1500 6.93 4563.60 1038.91
burstgpt 10 11970 13848 1878 2.05 2839.61 385.09

Throughput Results, main branch

Dataset Processed Prompts Total Prompt Tokens Total Tokens Total Output Tokens Requests/s Total Tokens/s Output Tokens/s
random 10 10240 11520 1280 6.85 7896.75 877.42
ShareGPT_V3_unfiltered_cleaned_split.json 10 2474 3751 1277 2.89 1085.52 369.56
sonnet 10
burstgpt 10

JenZhao avatar Mar 09 '25 08:03 JenZhao

Thoughput Results, this branch

sharegpt does not match, will look into this later.

Dataset Processed Prompts Total Prompt Tokens Total Tokens Total Output Tokens Requests/s Total Tokens/s Output Tokens/s random 10 10240 11520 1280 6.74 7765.54 862.84 ShareGPT_V3_unfiltered_cleaned_split.json 10 1798 3710 1912 2.75 1021.74 526.57 sonnet 10 5089 6589 1500 6.93 4563.60 1038.91 burstgpt 10 11970 13848 1878 2.05 2839.61 385.09 Throughput Results, main branch

Dataset Processed Prompts Total Prompt Tokens Total Tokens Total Output Tokens Requests/s Total Tokens/s Output Tokens/s random 10 10240 11520 1280 6.85 7896.75 877.42 ShareGPT_V3_unfiltered_cleaned_split.json 10 2474 3751 1277 2.89 1085.52 369.56 sonnet 10 burstgpt 10

We should try to find out why sampling for ShareGPT is different between main and this branch, since this is actually quite important. Also can you check for 1000 requests?

ywang96 avatar Mar 09 '25 08:03 ywang96

Thoughput Results, this branch sharegpt does not match, will look into this later. Dataset Processed Prompts Total Prompt Tokens Total Tokens Total Output Tokens Requests/s Total Tokens/s Output Tokens/s random 10 10240 11520 1280 6.74 7765.54 862.84 ShareGPT_V3_unfiltered_cleaned_split.json 10 1798 3710 1912 2.75 1021.74 526.57 sonnet 10 5089 6589 1500 6.93 4563.60 1038.91 burstgpt 10 11970 13848 1878 2.05 2839.61 385.09 Throughput Results, main branch Dataset Processed Prompts Total Prompt Tokens Total Tokens Total Output Tokens Requests/s Total Tokens/s Output Tokens/s random 10 10240 11520 1280 6.85 7896.75 877.42 ShareGPT_V3_unfiltered_cleaned_split.json 10 2474 3751 1277 2.89 1085.52 369.56 sonnet 10 burstgpt 10

We should try to find out why sampling for ShareGPT is different between main and this branch, since this is actually quite important. Also can you check for 1000 requests?

ok they now match after setting the same random seed...

main

Dataset Processed Prompts Total Prompt Tokens Total Tokens Total Output Tokens Requests/s Total Tokens/s Output Tokens/s
ShareGPT_V3_unfiltered_cleaned_split.json 1000 215196 413539 198343 48.12 19901.29 9545.12

this branch

Dataset Processed Prompts Total Prompt Tokens Total Tokens Total Output Tokens Requests/s Total Tokens/s Output Tokens/s
ShareGPT_V3_unfiltered_cleaned_split.json 1000 215196 413539 198343 48.57 20084.61 9633.05

JenZhao avatar Mar 09 '25 10:03 JenZhao