Wide range of metrics (rpm,e2el) compared to other solutions
I compared benchmarks of the same model with different tools: llmperf, genaiperf and vllm/benchmarks. And I got different rpm results everywhere.
The spread, especially on 50 threads, is very large. And then it doesn't seem to decrease, which seems strange.
Qwen2.5-72b-awq model was started on its my server.
Genai-perf Request Throughput (per sec),2.83 RPM = 2.83 * 60 = ~170
LLMPerf "results_num_completed_requests_per_min": 95.78302671905037,
vllm/benchmarks Request throughput (req/s): 2.30 RPM = 2.3 * 60 = ~138
Datasets were used by sonnet, as it is in tools. input tokens = 300, output tokens = 200, stddev = 0, duration_sec = 60, MAX_NUM_COMPLETED_REQUESTS=600
# vllm, DATASET_NAME=sonnet
python benchmark_serving.py \
--backend openai-chat \
--model "${MODEL}" \
--host ${LLM_HOST} \
--port ${LLM_PORT} \
--endpoint /v1/chat/completions \
--dataset-name ${DATASET_NAME} \
--dataset-path ./sonnet.txt \
--max-concurrency 50 \
--save-result \
--save-detailed \
--result-dir "${OUTPUT_DIR}/${folder}" \
--percentile-metrics ttft,tpot,itl,e2el \
--metric-percentiles "50,90,95,99" \
--${DATASET_NAME}-input-len $INPUT_SEQUENCE_LENGTH \
--${DATASET_NAME}-output-len $OUTPUT_SEQUENCE_LENGTH \
--num-prompts ${MAX_NUM_COMPLETED_REQUESTS} \
--ignore-eos \
--goodput e2el:${DURATION_MSEC}
# llmperf
python token_benchmark_ray.py \
--model "${MODEL}" \
--mean-input-tokens ${INPUT_SEQUENCE_LENGTH} --stddev-input-tokens ${STDDEV} \
--mean-output-tokens ${OUTPUT_SEQUENCE_LENGTH} --stddev-output-tokens ${STDDEV} \
--max-num-completed-requests ${MAX_NUM_COMPLETED_REQUESTS} \
--num-concurrent-requests 50 \
--timeout ${DURATION_SEC} \
--results-dir "${OUTPUT_DIR}/${folder}" \
--llm-api openai \
--additional-sampling-params '{"ignore_eos": true}'
# genaiperf, MAX_NUM_COMPLETED_REQUESTS=100
genai-perf analyze --random-seed ${seed}
--service-kind openai --endpoint-type chat --streaming
--url ${llm_host} -m ${model}
--extra-inputs ignore_eos:true
--extra-inputs max_tokens:${output_sequence_length}
--extra-inputs min_tokens:${output_sequence_length}
--output-tokens-mean ${output_sequence_length} --output-tokens-stddev ${stddev}
--synthetic-input-tokens-mean ${input_sequence_length} --synthetic-input-tokens-stddev ${stddev}
-v --measurement-interval ${duration_msec}
--warmup-request-count 10
--num-dataset-entries ${MAX_NUM_COMPLETED_REQUESTS}
--profile-export-file ${input_sequence_length}_${output_sequence_length}.json
--sweep-type concurrency --sweep-list 50,100
Qwen3 - without thinking (concurrency=1,3,5,8,13,21,34,55,89,144, MAX_NUM_COMPLETED_REQUESTS=100):
At the same time, the vlm service counters show 135 revolutions per minute, when 143 requests were processed during the processing of the service. llmperf counts 35 rpm at the same time. genai-perf writes 102 rpm at 144 and vllm - 109 in graphane. That is, it seems genai-perf seems to give out more truthful values, but I still don't understand - I compared them using formulas and implementations. It seems that there should be no such differences.
Formula: rate(vllm:request_success_total[$__rate_interval]) * 60
Can you tell me what this could be related to? How should I configure llmperf so that the results are at least relatively the same as genai-perf?
At same time, response time increases for all tools. therefore, decrease in rpm from llmperf seems to justify it. but vllm's rpm does not change (if you do not look at what metrics inside vllm-service give out), while genai-perf's rpm, on the contrary, is still growing. This is all about same promptness (sonnet) and size of input and output tokens.
Sorry, more indicative RPM chart turned out to be 300/200 = input/output tokens.