llmperf icon indicating copy to clipboard operation
llmperf copied to clipboard

Wide range of metrics (rpm,e2el) compared to other solutions

Open psydok opened this issue 6 months ago • 2 comments

I compared benchmarks of the same model with different tools: llmperf, genaiperf and vllm/benchmarks. And I got different rpm results everywhere.

The spread, especially on 50 threads, is very large. And then it doesn't seem to decrease, which seems strange.

Qwen2.5-72b-awq model was started on its my server.

Genai-perf Request Throughput (per sec),2.83 RPM = 2.83 * 60 = ~170

LLMPerf "results_num_completed_requests_per_min": 95.78302671905037,

vllm/benchmarks Request throughput (req/s): 2.30 RPM = 2.3 * 60 = ~138

Datasets were used by sonnet, as it is in tools. input tokens = 300, output tokens = 200, stddev = 0, duration_sec = 60, MAX_NUM_COMPLETED_REQUESTS=600

# vllm, DATASET_NAME=sonnet
    python benchmark_serving.py \
        --backend openai-chat \
        --model "${MODEL}" \
        --host ${LLM_HOST} \
        --port ${LLM_PORT} \
        --endpoint /v1/chat/completions \
        --dataset-name ${DATASET_NAME} \
        --dataset-path ./sonnet.txt \
        --max-concurrency 50 \
        --save-result \
        --save-detailed \
        --result-dir "${OUTPUT_DIR}/${folder}" \
        --percentile-metrics ttft,tpot,itl,e2el \
        --metric-percentiles "50,90,95,99" \
        --${DATASET_NAME}-input-len $INPUT_SEQUENCE_LENGTH \
        --${DATASET_NAME}-output-len $OUTPUT_SEQUENCE_LENGTH \
        --num-prompts ${MAX_NUM_COMPLETED_REQUESTS} \
        --ignore-eos \
        --goodput e2el:${DURATION_MSEC}

# llmperf
    python token_benchmark_ray.py \
        --model "${MODEL}" \
        --mean-input-tokens ${INPUT_SEQUENCE_LENGTH} --stddev-input-tokens ${STDDEV} \
        --mean-output-tokens ${OUTPUT_SEQUENCE_LENGTH} --stddev-output-tokens ${STDDEV} \
        --max-num-completed-requests ${MAX_NUM_COMPLETED_REQUESTS} \
        --num-concurrent-requests 50 \
        --timeout ${DURATION_SEC} \
        --results-dir "${OUTPUT_DIR}/${folder}" \
        --llm-api openai \
        --additional-sampling-params '{"ignore_eos": true}'

# genaiperf, MAX_NUM_COMPLETED_REQUESTS=100
      genai-perf analyze --random-seed ${seed}
      --service-kind openai --endpoint-type chat --streaming
      --url ${llm_host} -m ${model}
      --extra-inputs ignore_eos:true
      --extra-inputs max_tokens:${output_sequence_length}
      --extra-inputs min_tokens:${output_sequence_length}
      --output-tokens-mean ${output_sequence_length} --output-tokens-stddev ${stddev}
      --synthetic-input-tokens-mean ${input_sequence_length} --synthetic-input-tokens-stddev ${stddev}
      -v --measurement-interval ${duration_msec}
      --warmup-request-count 10
      --num-dataset-entries ${MAX_NUM_COMPLETED_REQUESTS}
      --profile-export-file ${input_sequence_length}_${output_sequence_length}.json
      --sweep-type concurrency --sweep-list 50,100

Qwen3 - without thinking (concurrency=1,3,5,8,13,21,34,55,89,144, MAX_NUM_COMPLETED_REQUESTS=100):

Image

At the same time, the vlm service counters show 135 revolutions per minute, when 143 requests were processed during the processing of the service. llmperf counts 35 rpm at the same time. genai-perf writes 102 rpm at 144 and vllm - 109 in graphane. That is, it seems genai-perf seems to give out more truthful values, but I still don't understand - I compared them using formulas and implementations. It seems that there should be no such differences. Formula: rate(vllm:request_success_total[$__rate_interval]) * 60

Image

Can you tell me what this could be related to? How should I configure llmperf so that the results are at least relatively the same as genai-perf?

psydok avatar Jul 02 '25 10:07 psydok

At same time, response time increases for all tools. therefore, decrease in rpm from llmperf seems to justify it. but vllm's rpm does not change (if you do not look at what metrics inside vllm-service give out), while genai-perf's rpm, on the contrary, is still growing. This is all about same promptness (sonnet) and size of input and output tokens.

Image

psydok avatar Jul 03 '25 09:07 psydok

Sorry, more indicative RPM chart turned out to be 300/200 = input/output tokens.

Image

psydok avatar Jul 20 '25 17:07 psydok