tensorrtllm and vllm backend results are different using genai-perf

Open upskyy opened this issue 1 year ago • 1 comments

Thank you for releasing a great project.

I measured genai-perf by running the rtzr/ko-gemma-2-9b-it (gemma-2-9b-it fine-tuning model) model with the tritonserver vllm backend and tritonserver tensorrt_llm backend. However, the two Output sequence length metrics are different, so I think the Output token throughput (per sec) is different.

Since output-tokens-mean was set to 100 in the argument, vllm came out as 100, and tensorrtllm seems to come out as 100 added to the input sequence length.

I ran genai-perf in nvcr.io/nvidia/tritonserver:24.07-py3-sdk docker.

Please let me know if there is anything that needs to be corrected or something I did wrong. I'll attach the script and the results.

tensorrtllm

genai-perf -m ensemble   --service-kind triton   --backend tensorrtllm   --num-prompts 100   --random-seed 123   --synthetic-input-tokens-mean 200   --synthetic-input-tokens-stddev 0  --output-tokens-mean 100   --output-tokens-stddev 0   --output-tokens-mean-deterministic   --tokenizer rtzr/ko-gemma-2-9b-it   --concurrency 1   --measurement-interval 4000   --profile-export-file my_profile_export.json   --url localhost:8001


concurrency: 1

                                        LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃              Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│   Request latency (ms) │ 1,606.41 │ 1,593.64 │ 1,617.31 │ 1,617.19 │ 1,616.06 │ 1,610.55 │
│ Output sequence length │   299.50 │   298.00 │   300.00 │   300.00 │   300.00 │   300.00 │
│  Input sequence length │   199.75 │   199.00 │   200.00 │   200.00 │   200.00 │   200.00 │
└────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Output token throughput (per sec): 186.43
Request throughput (per sec): 0.62
2024-09-04 09:48 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency1/profile_export_genai_perf.json
2024-09-04 09:48 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency1/profile_export_genai_perf.csv




concurrency: 4
                                        LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃              Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│   Request latency (ms) │ 1,781.95 │ 1,740.25 │ 2,142.17 │ 2,103.83 │ 1,777.44 │ 1,765.17 │
│ Output sequence length │   299.77 │   298.00 │   300.00 │   300.00 │   300.00 │   300.00 │
│  Input sequence length │   199.84 │   199.00 │   200.00 │   200.00 │   200.00 │   200.00 │
└────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Output token throughput (per sec): 649.28
Request throughput (per sec): 2.17
2024-09-04 09:51 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency4/profile_export_genai_perf.json
2024-09-04 09:51 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency4/profile_export_genai_perf.csv



concurrency: 8
                                        LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃              Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│   Request latency (ms) │ 2,091.10 │ 1,970.12 │ 2,943.90 │ 2,881.30 │ 2,313.61 │ 2,029.94 │
│ Output sequence length │   299.64 │   297.00 │   300.00 │   300.00 │   300.00 │   300.00 │
│  Input sequence length │   199.90 │   199.00 │   200.00 │   200.00 │   200.00 │   200.00 │
└────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Output token throughput (per sec): 1054.81
Request throughput (per sec): 3.52
2024-09-04 09:53 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency8/profile_export_genai_perf.json
2024-09-04 09:53 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency8/profile_export_genai_perf.csv

vllm

genai-perf   -m rtzr_gemma2   --service-kind triton   --backend vllm   --num-prompts 100   --random-seed 123   --synthetic-input-tokens-mean 200   --synthetic-input-tokens-stddev 0   --output-tokens-mean 100   --output-tokens-stddev 0   --output-tokens-mean-deterministic   --tokenizer rtzr/ko-gemma-2-9b-it   --concurrency 1   --measurement-interval 4000   --profile-export-file my_profile_export.json   --url localhost:18001


concurrency: 1

                                        LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃              Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│   Request latency (ms) │ 3,792.74 │ 3,781.30 │ 3,812.85 │ 3,812.27 │ 3,807.09 │ 3,798.46 │
│ Output sequence length │   100.00 │   100.00 │   100.00 │   100.00 │   100.00 │   100.00 │
│  Input sequence length │   200.00 │   200.00 │   200.00 │   200.00 │   200.00 │   200.00 │
└────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Output token throughput (per sec): 26.37
Request throughput (per sec): 0.26
2024-09-05 04:01 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency1/profile_export_genai_perf.json
2024-09-05 04:01 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency1/profile_export_genai_perf.csv



concurrency: 4

                                        LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃              Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│   Request latency (ms) │ 3,996.60 │ 3,990.91 │ 4,007.69 │ 4,007.69 │ 4,007.66 │ 4,007.18 │
│ Output sequence length │    99.67 │    96.00 │   100.00 │   100.00 │   100.00 │   100.00 │
│  Input sequence length │   199.75 │   199.00 │   200.00 │   200.00 │   200.00 │   200.00 │
└────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Output token throughput (per sec): 99.75
Request throughput (per sec): 1.00
2024-09-05 04:02 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency4/profile_export_genai_perf.json
2024-09-05 04:02 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency4/profile_export_genai_perf.csv



concurrency: 8
                                        LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃              Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│   Request latency (ms) │ 4,125.69 │ 4,090.61 │ 4,192.69 │ 4,192.68 │ 4,192.45 │ 4,191.99 │
│ Output sequence length │    99.92 │    98.00 │   100.00 │   100.00 │   100.00 │   100.00 │
│  Input sequence length │   199.88 │   199.00 │   200.00 │   200.00 │   200.00 │   200.00 │
└────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Output token throughput (per sec): 193.71
Request throughput (per sec): 1.94
2024-09-05 04:04 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency8/profile_export_genai_perf.json
2024-09-05 04:04 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency8/profile_export_genai_perf.csv

Sep 05 '24 04:09 upskyy

Apologies for the delayed response. You need to set exclude_output_in_input to true in the model config to not echo the input tokens in the output for TensorRT-LLM.

There was a limitation in TensorRT-LLM that prevented GenAI-Perf from setting this value automatically. That limitation might have been lifted recently. We have it in our queue to investigate whether GenAI-Perf can now take care of this for you.

Nov 05 '24 16:11 the-david-oy