problem with tensorrt_llm performance

Open Arnold1 opened this issue 1 year ago • 2 comments

System Info

hi,

i generated the tensorrt llm engine for a llama based model and see that the performance is much worse than vllm.

i did the following:

compile model with tensorrt llm compiler
configure the triton inference server repo
- configure inflight batching for tensorrt llm
start triton inference llm server
benchmark to compare tensorrt llm with vllm

questions:

is there a problem in the tensorrt llm engine build process?
how can i re-configure tensorrt llm to get better latency vs. throughput numbers similar to vllm?
i also tried to set max_batch_size to 1 - it doesnt really change anything... - any idea?
if you tell me to use the latest tensorrt_llm package - what nvcr.io/nvidia/tritonserver image should i use for the tensorrt llm engine generation process and what image for tritonserver doing inference?

setup:

Used Image to compile the engine and run triton inference server: nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3
Used Tensorrt llm version: 0.10.0 - included in the image above
GPU name: 1 x Nvidia A10
GPU memory: 24 gigabytes (GB)
LLM: Meta-Llama-Guard-2-8B

used gpu:

nvidia-smi
Thu Jul 11 23:51:14 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   49C    P0              70W / 300W |  16834MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     74156      C   tritonserver                              16824MiB |
+---------------------------------------------------------------------------------------+

build tensorrt llm engine and create triton repo: create_trt_engine.txt

started triton inference server and triton inference server model configs: start_triton_inference.txt

benchmark triton inference:

2024/07/11 23:37:56 ============ Serving Benchmark Result ============
2024/07/11 23:37:56 Benchmark Duration (sec): 120.10
2024/07/11 23:37:56 Number of total requests: 362
2024/07/11 23:37:56 Success Rate (Percent): 100.00
2024/07/11 23:37:56 Concurrency: 1
2024/07/11 23:37:56 Request throughput (req/sec): 3.014
2024/07/11 23:37:56 Prompt throughput (tokens/second) avg: 2224.493
2024/07/11 23:37:56 Generation throughput (tokens/second) avg: 12.057
2024/07/11 23:37:56 End to End Latency (ms) avg: 328.207
2024/07/11 23:37:56 End to End Latency (ms) p50: 328.000
2024/07/11 23:37:56 End to End Latency (ms) p90: 329.000

2024/07/11 23:37:56 Running load test with concurrency 5...
2024/07/11 23:39:57 ============ Serving Benchmark Result ============
2024/07/11 23:39:57 Benchmark Duration (sec): 121.45
2024/07/11 23:39:57 Number of total requests: 375
2024/07/11 23:39:57 Success Rate (Percent): 100.00
2024/07/11 23:39:57 Concurrency: 5
2024/07/11 23:39:57 Request throughput (req/sec): 3.088
2024/07/11 23:39:57 Prompt throughput (tokens/second) avg: 2278.757
2024/07/11 23:39:57 Generation throughput (tokens/second) avg: 12.351
2024/07/11 23:39:57 End to End Latency (ms) avg: 1607.147
2024/07/11 23:39:57 End to End Latency (ms) p50: 1616.000
2024/07/11 23:39:57 End to End Latency (ms) p90: 1617.000

2024/07/11 23:39:57 Running load test with concurrency 10...
2024/07/11 23:42:00 ============ Serving Benchmark Result ============
2024/07/11 23:42:00 Benchmark Duration (sec): 123.06
2024/07/11 23:42:00 Number of total requests: 380
2024/07/11 23:42:00 Success Rate (Percent): 100.00
2024/07/11 23:42:00 Concurrency: 10
2024/07/11 23:42:00 Request throughput (req/sec): 3.088
2024/07/11 23:42:00 Prompt throughput (tokens/second) avg: 2278.813
2024/07/11 23:42:00 Generation throughput (tokens/second) avg: 12.351
2024/07/11 23:42:00 End to End Latency (ms) avg: 3196.500
2024/07/11 23:42:00 End to End Latency (ms) p50: 3235.000
2024/07/11 23:42:00 End to End Latency (ms) p90: 3236.000

2024/07/11 23:42:00 Running load test with concurrency 20...
2024/07/11 23:44:07 ============ Serving Benchmark Result ============
2024/07/11 23:44:07 Benchmark Duration (sec): 126.30
2024/07/11 23:44:07 Number of total requests: 390
2024/07/11 23:44:07 Success Rate (Percent): 100.00
2024/07/11 23:44:07 Concurrency: 20
2024/07/11 23:44:07 Request throughput (req/sec): 3.088
2024/07/11 23:44:07 Prompt throughput (tokens/second) avg: 2278.796
2024/07/11 23:44:07 Generation throughput (tokens/second) avg: 12.351
2024/07/11 23:44:07 End to End Latency (ms) avg: 6315.615
2024/07/11 23:44:07 End to End Latency (ms) p50: 6473.000
2024/07/11 23:44:07 End to End Latency (ms) p90: 6474.000

2024/07/11 23:44:07 Running load test with concurrency 30...
2024/07/11 23:46:16 ============ Serving Benchmark Result ============
2024/07/11 23:46:16 Benchmark Duration (sec): 129.54
2024/07/11 23:46:16 Number of total requests: 400
2024/07/11 23:46:16 Success Rate (Percent): 100.00
2024/07/11 23:46:16 Concurrency: 30
2024/07/11 23:46:16 Request throughput (req/sec): 3.088
2024/07/11 23:46:16 Prompt throughput (tokens/second) avg: 2278.771
2024/07/11 23:46:16 Generation throughput (tokens/second) avg: 12.351
2024/07/11 23:46:16 End to End Latency (ms) avg: 9359.320
2024/07/11 23:46:16 End to End Latency (ms) p50: 9712.000
2024/07/11 23:46:16 End to End Latency (ms) p90: 9713.000

2024/07/11 23:46:16 Running load test with concurrency 40...
2024/07/11 23:48:29 ============ Serving Benchmark Result ============
2024/07/11 23:48:29 Benchmark Duration (sec): 132.79
2024/07/11 23:48:29 Number of total requests: 410
2024/07/11 23:48:29 Success Rate (Percent): 100.00
2024/07/11 23:48:29 Concurrency: 40
2024/07/11 23:48:29 Request throughput (req/sec): 3.088
2024/07/11 23:48:29 Prompt throughput (tokens/second) avg: 2278.700
2024/07/11 23:48:29 Generation throughput (tokens/second) avg: 12.351
2024/07/11 23:48:29 End to End Latency (ms) avg: 12334.346
2024/07/11 23:48:29 End to End Latency (ms) p50: 12950.000
2024/07/11 23:48:29 End to End Latency (ms) p90: 12951.000

2024/07/11 23:48:29 Running load test with concurrency 50...
2024/07/11 23:50:45 ============ Serving Benchmark Result ============
2024/07/11 23:50:45 Benchmark Duration (sec): 136.02
2024/07/11 23:50:45 Number of total requests: 420
2024/07/11 23:50:45 Success Rate (Percent): 100.00
2024/07/11 23:50:45 Concurrency: 50
2024/07/11 23:50:45 Request throughput (req/sec): 3.088
2024/07/11 23:50:45 Prompt throughput (tokens/second) avg: 2278.776
2024/07/11 23:50:45 Generation throughput (tokens/second) avg: 12.351
2024/07/11 23:50:45 End to End Latency (ms) avg: 15243.845
2024/07/11 23:50:45 End to End Latency (ms) p50: 16188.000
2024/07/11 23:50:45 End to End Latency (ms) p90: 16189.000

deploy vllm container:

#!/bin/bash

MODEL="meta-llama/Meta-Llama-Guard-2-8B"
HUGGING_FACE_HUB_TOKEN="xxx"

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}" \
    -p 8080:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model ${MODEL} \
    --gpu-memory-utilization 0.90 \
    --max-model-len 8192 \
    --kv-cache-dtype auto \
    --enable-prefix-caching \
    --max-num-batched-tokens 8192

start vllm container:

./deploy.sh 
Unable to find image 'vllm/vllm-openai:latest' locally
latest: Pulling from vllm/vllm-openai
3c645031de29: Pull complete 
0d6448aff889: Pull complete 
0a7674e3e8fe: Pull complete 
b71b637b97c5: Pull complete 
56dc85502937: Pull complete 
380ca03515b9: Pull complete 
b9e353cd3958: Pull complete 
57efca880186: Pull complete 
2735a04f6870: Pull complete 
175b4b06144d: Pull complete 
5dc5ca7a92cf: Pull complete 
203e66f482bf: Pull complete 
Digest: sha256:e58fceffa6f8d3e4d535f9e7128361cd33469b232a8dc670967b62ae62bac5fe
Status: Downloaded newer image for vllm/vllm-openai:latest
INFO 07-12 00:08:12 api_server.py:206] vLLM API server version 0.5.1
INFO 07-12 00:08:12 api_server.py:207] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='meta-llama/Meta-Llama-Guard-2-8B', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=True, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=8192, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-12 00:08:12 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='meta-llama/Meta-Llama-Guard-2-8B', speculative_config=None, tokenizer='meta-llama/Meta-Llama-Guard-2-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=meta-llama/Meta-Llama-Guard-2-8B, use_v2_block_manager=False, enable_prefix_caching=True)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-12 00:08:13 weight_utils.py:218] Using model weights format ['*.safetensors']
INFO 07-12 00:10:15 model_runner.py:255] Loading model weights took 14.9595 GB
INFO 07-12 00:10:17 gpu_executor.py:84] # GPU blocks: 1639, # CPU blocks: 2048
INFO 07-12 00:10:18 model_runner.py:924] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-12 00:10:18 model_runner.py:928] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-12 00:10:31 model_runner.py:1117] Graph capturing finished in 12 secs.
INFO 07-12 00:10:31 block_manager_v1.py:247] Automatic prefix caching is enabled.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-12 00:10:31 serving_chat.py:94] Using default chat template:
INFO 07-12 00:10:31 serving_chat.py:94] {% if messages|length % 2 == 0 %}{% set role = 'Agent' %}{% else %}{% set role = 'User' %}{% endif %}{{ "[INST] Task: Check if there is unsafe content in '" + role + "' messages in conversations according our safety policy with the below categories.
INFO 07-12 00:10:31 serving_chat.py:94] 
INFO 07-12 00:10:31 serving_chat.py:94] <BEGIN UNSAFE CONTENT CATEGORIES>
INFO 07-12 00:10:31 serving_chat.py:94] S1: Violent Crimes. 
INFO 07-12 00:10:31 serving_chat.py:94] S2: Non-Violent Crimes. 
INFO 07-12 00:10:31 serving_chat.py:94] S3: Sex Crimes. 
INFO 07-12 00:10:31 serving_chat.py:94] S4: Child Exploitation. 
INFO 07-12 00:10:31 serving_chat.py:94] S5: Specialized Advice. 
INFO 07-12 00:10:31 serving_chat.py:94] S6: Privacy. 
INFO 07-12 00:10:31 serving_chat.py:94] S7: Intellectual Property. 
INFO 07-12 00:10:31 serving_chat.py:94] S8: Indiscriminate Weapons. 
INFO 07-12 00:10:31 serving_chat.py:94] S9: Hate. 
INFO 07-12 00:10:31 serving_chat.py:94] S10: Self-Harm. 
INFO 07-12 00:10:31 serving_chat.py:94] S11: Sexual Content. 
INFO 07-12 00:10:31 serving_chat.py:94] <END UNSAFE CONTENT CATEGORIES>
INFO 07-12 00:10:31 serving_chat.py:94] 
INFO 07-12 00:10:31 serving_chat.py:94] <BEGIN CONVERSATION>
INFO 07-12 00:10:31 serving_chat.py:94] 
INFO 07-12 00:10:31 serving_chat.py:94] " }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{% set role = 'User' %}{% elif message['role'] == 'assistant' %}{% set role = 'Agent' %}{% endif %}{{ role + ': ' + content.strip() + '
INFO 07-12 00:10:31 serving_chat.py:94] 
INFO 07-12 00:10:31 serving_chat.py:94] ' }}{% endfor %}{{ "<END CONVERSATION>
INFO 07-12 00:10:31 serving_chat.py:94] 
INFO 07-12 00:10:31 serving_chat.py:94] Provide your safety assessment for ONLY THE LAST " + role + " message in the above conversation:
INFO 07-12 00:10:31 serving_chat.py:94]  - First line must read 'safe' or 'unsafe'.
INFO 07-12 00:10:31 serving_chat.py:94]  - If unsafe, a second line must include a comma-separated list of violated categories. [/INST]" }}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 07-12 00:10:32 serving_embedding.py:141] embedding_mode is False. Embedding API will not work.
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

benchmark vllm:

2024/07/12 02:23:54 Running load test with concurrency 1...
2024/07/12 02:25:54 ============ Serving Benchmark Result ============
2024/07/12 02:25:54 Benchmark Duration (sec): 120.18
2024/07/12 02:25:54 Number of total requests: 656
2024/07/12 02:25:54 Success Rate (Percent): 100.00
2024/07/12 02:25:54 Concurrency: 1
2024/07/12 02:25:54 Request throughput (req/sec): 5.459
2024/07/12 02:25:54 Prompt throughput (tokens/second) avg: 32.751
2024/07/12 02:25:54 Generation throughput (tokens/second) avg: 21.834
2024/07/12 02:25:54 End to End Latency (ms) avg: 182.445
2024/07/12 02:25:54 End to End Latency (ms) p50: 182.000
2024/07/12 02:25:54 End to End Latency (ms) p90: 183.000

2024/07/12 02:25:54 Running load test with concurrency 5...
2024/07/12 02:27:55 ============ Serving Benchmark Result ============
2024/07/12 02:27:55 Benchmark Duration (sec): 120.07
2024/07/12 02:27:55 Number of total requests: 2425
2024/07/12 02:27:55 Success Rate (Percent): 100.00
2024/07/12 02:27:55 Concurrency: 5
2024/07/12 02:27:55 Request throughput (req/sec): 20.196
2024/07/12 02:27:55 Prompt throughput (tokens/second) avg: 121.179
2024/07/12 02:27:55 Generation throughput (tokens/second) avg: 80.786
2024/07/12 02:27:55 End to End Latency (ms) avg: 246.920
2024/07/12 02:27:55 End to End Latency (ms) p50: 247.000
2024/07/12 02:27:55 End to End Latency (ms) p90: 250.000

2024/07/12 02:27:55 Running load test with concurrency 10...
2024/07/12 02:29:55 ============ Serving Benchmark Result ============
2024/07/12 02:29:55 Benchmark Duration (sec): 120.24
2024/07/12 02:29:55 Number of total requests: 4340
2024/07/12 02:29:55 Success Rate (Percent): 100.00
2024/07/12 02:29:55 Concurrency: 10
2024/07/12 02:29:55 Request throughput (req/sec): 36.096
2024/07/12 02:29:55 Prompt throughput (tokens/second) avg: 216.573
2024/07/12 02:29:55 Generation throughput (tokens/second) avg: 144.382
2024/07/12 02:29:55 End to End Latency (ms) avg: 276.402
2024/07/12 02:29:55 End to End Latency (ms) p50: 275.000
2024/07/12 02:29:55 End to End Latency (ms) p90: 282.000

2024/07/12 02:29:55 Running load test with concurrency 20...
2024/07/12 02:31:55 ============ Serving Benchmark Result ============
2024/07/12 02:31:55 Benchmark Duration (sec): 120.01
2024/07/12 02:31:55 Number of total requests: 5760
2024/07/12 02:31:55 Success Rate (Percent): 100.00
2024/07/12 02:31:55 Concurrency: 20
2024/07/12 02:31:55 Request throughput (req/sec): 47.998
2024/07/12 02:31:55 Prompt throughput (tokens/second) avg: 287.985
2024/07/12 02:31:55 Generation throughput (tokens/second) avg: 191.990
2024/07/12 02:31:55 End to End Latency (ms) avg: 416.056
2024/07/12 02:31:55 End to End Latency (ms) p50: 353.000
2024/07/12 02:31:55 End to End Latency (ms) p90: 668.000

2024/07/12 02:31:55 Running load test with concurrency 30...
2024/07/12 02:33:55 ============ Serving Benchmark Result ============
2024/07/12 02:33:55 Benchmark Duration (sec): 120.15
2024/07/12 02:33:55 Number of total requests: 7170
2024/07/12 02:33:55 Success Rate (Percent): 100.00
2024/07/12 02:33:55 Concurrency: 30
2024/07/12 02:33:55 Request throughput (req/sec): 59.675
2024/07/12 02:33:55 Prompt throughput (tokens/second) avg: 358.051
2024/07/12 02:33:55 Generation throughput (tokens/second) avg: 238.701
2024/07/12 02:33:55 End to End Latency (ms) avg: 502.089
2024/07/12 02:33:55 End to End Latency (ms) p50: 474.000
2024/07/12 02:33:55 End to End Latency (ms) p90: 502.000

2024/07/12 02:33:55 Running load test with concurrency 40...
2024/07/12 02:35:55 ============ Serving Benchmark Result ============
2024/07/12 02:35:55 Benchmark Duration (sec): 120.51
2024/07/12 02:35:55 Number of total requests: 8280
2024/07/12 02:35:55 Success Rate (Percent): 100.00
2024/07/12 02:35:55 Concurrency: 40
2024/07/12 02:35:55 Request throughput (req/sec): 68.707
2024/07/12 02:35:55 Prompt throughput (tokens/second) avg: 412.240
2024/07/12 02:35:55 Generation throughput (tokens/second) avg: 274.827
2024/07/12 02:35:55 End to End Latency (ms) avg: 581.544
2024/07/12 02:35:55 End to End Latency (ms) p50: 573.000
2024/07/12 02:35:55 End to End Latency (ms) p90: 586.000

2024/07/12 02:35:55 Running load test with concurrency 50...
2024/07/12 02:37:56 ============ Serving Benchmark Result ============
2024/07/12 02:37:56 Benchmark Duration (sec): 120.16
2024/07/12 02:37:56 Number of total requests: 8600
2024/07/12 02:37:56 Success Rate (Percent): 100.00
2024/07/12 02:37:56 Concurrency: 50
2024/07/12 02:37:56 Request throughput (req/sec): 71.573
2024/07/12 02:37:56 Prompt throughput (tokens/second) avg: 429.441
2024/07/12 02:37:56 Generation throughput (tokens/second) avg: 286.294
2024/07/12 02:37:56 End to End Latency (ms) avg: 697.933
2024/07/12 02:37:56 End to End Latency (ms) p50: 689.000
2024/07/12 02:37:56 End to End Latency (ms) p90: 707.000

Who can help?

@hijkzzz @Tracin @yuxianq @Njuapp @uppalutkarsh @nv-guomingz

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

is all in code above.

Expected behavior

better performance for concurrent requests and similar performance to vllm

actual behavior

performance degration

additional notes

Jul 12 '24 02:07 Arnold1

@kaiyux Could you please have a look? Thanks

Jul 15 '24 01:07 QiJune

Hi @Arnold1 , how did you get the benchmark results for triton inference and vllm? Can you share your detailed steps, so I can reproduce your results quickly to see the root cause of the gap?

Jul 17 '24 10:07 sunnyqgg

Hi @Arnold1 , @sunnyqgg were you guys able to figure the root cause here ? I am also observing similar trend for llama2-7b model I am using the latest version of both trt-llm and vllm and respective latest triton servers

Sep 03 '24 05:09 ashwin-js

Hi @ashwin-js ， this's not expected, can you share your steps and commands for both?

Sep 04 '24 02:09 sunnyqgg

@Arnold1 @ashwin-js If you have no further questions, we will close this issue in a week.

Nov 14 '24 05:11 hello-11

Title: Performance Issue: TensorRT-LLM (v0.20.0/v25.06) Significantly Slower than vLLM for Single Request on Qwen-based Model

Description When serving the casperhansen/deepseek-r1-distill-qwen-14b-awq model, I'm observing a significant performance degradation when using TensorRT-LLM with Triton compared to vLLM. A single request takes approximately 3.7 seconds with TensorRT-LLM, whereas the same request completes in just 0.4 seconds with vLLM.

The primary issue appears to be that the TensorRT-LLM response includes a long, internal "thought process" before the actual JSON output, while vLLM returns the concise JSON directly. This suggests a potential mismatch in prompt templating between the two setups.

Environment

TensorRT-LLM Container: nvcr.io/nvidia/tritonserver:25.06-trtllm-python-py3
GPU: 1x NVIDIA L4 24GB
CPU: 10 vCPU AMD EPYC 7443
RAM: 96 GB
OS: Ubuntu 24.04
Nvidia driver version: 565
Model: casperhansen/deepseek-r1-distill-qwen-14b-awq

Reproduction Steps 1. vLLM Baseline (Fast) The vLLM server is deployed using its standard container, which demonstrates the expected performance.

# Start the vLLM OpenAI-compatible server
docker run --name vllm --gpus all -itd \
  -v ~/llm/models:/root/.cache/huggingface \
  --net=host --ipc=host \
  vllm/vllm-openai:latest \
  --model casperhansen/deepseek-r1-distill-qwen-14b-awq \
  --max_model_len 8192 \
  --host 0.0.0.0 --port 8080

2. Triton + TensorRT-LLM Setup (Slow) The model is converted and built for TensorRT-LLM and served via Triton.

# 1. Start the TRT-LLM container
docker run --name tritonllm -it --net host --shm-size=16g \
    --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
    -v ~/:/root/ \
    -v ~/llm/models:/root/.cache/huggingface \
    -v ~/llm/engines:/engines \
    nvcr.io/nvidia/tritonserver:25.06-trtllm-python-py3

# 2. Define environment variables inside the container
ENGINE_DIR=/engines/tllm_casperhansen-deepseek-r1-distill-qwen-14b-checkpoint_quantized_int4-awq
TOKENIZER_DIR=/root/.cache/huggingface/hub/models--casperhansen--deepseek-r1-distill-qwen-14b-awq/snapshots/bc43ec1bbf08de53452630806d5989208b4186db
MODEL_DIR=${TOKENIZER_DIR}

# 3. Convert the Hugging Face checkpoint to TRT-LLM format
python3 /app/examples/models/core/qwen/convert_checkpoint.py \
    --model_dir ${MODEL_DIR} \
    --output_dir ${ENGINE_DIR} \
    --dtype float16 \
    --tp_size 1

# 4. Build the TensorRT-LLM engine
# Note: Added suggested flags for optimization
trtllm-build \
    --checkpoint_dir ${ENGINE_DIR} \
    --output_dir ${ENGINE_DIR} \
    --gemm_plugin float16 \
    --gpt_attention_plugin float16 \
    --remove_input_padding \
    --context_fmha enable \
    --paged_kv_cache enable \
    --use_inflight_batching \
    --max_batch_size 8 \
    --max_input_len 2048 \
    --max_seq_len 4096 \
    --max_num_tokens 8192

# 5. Prepare the Triton model repository
MODEL_FOLDER=/root/models_casperhansen-deepseek-r1-distill-qwen-14b-awq/
cp -r /app/all_models/inflight_batcher_llm/* ${MODEL_FOLDER}/

# 6. Configure the Triton model repository
python3 /app/tools/fill_template.py -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:8
python3 /app/tools/fill_template.py -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:8
python3 /app/tools/fill_template.py -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt decoupled_mode:true,engine_dir:${ENGINE_DIR},batching_strategy:inflight_fused_batching
python3 /app/tools/fill_template.py -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:8
python3 /app/tools/fill_template.py -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt decoupled_mode:true

# 7. Start the Triton server with the OpenAI frontend
python3 /opt/tritonserver/python/openai/openai_frontend/main.py --model-repository=${MODEL_FOLDER} --tokenizer=${TOKENIZER_DIR} --openai-port 8080

# Or, Start the server using trtllm-serve
trtllm-serve serve ${ENGINE_DIR} \
             --tokenizer ${TOKENIZER_DIR} \
             --host 0.0.0.0 \
             --port 8080 \
             --max_batch_size 8 \
             --log_level info \
             --max_num_tokens=300

Request Used for Testing

curl -X POST "http://127.0.0.1:8080/v1/chat/completions" \
 -H "Content-Type: application/json" \
 -d '{
    "model": "ensemble", #casperhansen/deepseek-r1-distill-qwen-14b-awq
    "messages": [
      {
        "role": "system",
        "content": "You are a multilingual vehicle information processor. Your task:\n\n1. Check if the input text is about a vehicle or a vehicle violation.\n2. Return only true or false.\n\nExample 1: Input: \"Black car not wearing a seatbelt\", Output: {\"isVehicleQuery\": true}\nExample 2: Input: \"Fried duck\", Output: {\"isVehicleQuery\": false}"
      },
      {
        "role": "user",
        "content": "A purple car is speeding with a white license plate number B12345FGH"
      }
    ],
    "temperature": 0.1,
    "max_tokens": 100,
    "response_format": {
      "type": "json_object"
     }
  }'

Observed vs. Expected Behavior Observed Behavior (Triton + TensorRT-LLM)

Latency: ~3.758 seconds. ` Output: The model generates a long "thought process" before providing a truncated answer. This indicates a potential prompt format issue and explains the high token generation time.

{
  "id": "cmpl-a97d3b7e-5ae7-11f0-b466-31b885f1f12f",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "<think>\nAlright, so I need to figure out whether the input text is about a vehicle or a vehicle violation. The task is to return true or false based on that. Let me break this down step by step.\n\nFirst, I'll look at the examples provided to understand the pattern. In Example 1, \"Black car not wearing a seatbelt\" returns true. It mentions a vehicle (car) and a violation (not wearing a seatbelt). Similarly, Example 3 is \"Red",
        "role": "assistant"
      }
    }
  ],
  "created": 1751861147,
  "model": "ensemble",
  "object": "chat.completion"
}

Expected Behavior (vLLM)

Latency: ~0.393 seconds.
Output: The model correctly and immediately returns the expected JSON object.

{
  "id": "chatcmpl-6ca35456b7c94bff98130f8449f3b293",
  "object": "chat.completion",
  "created": 1751861420,
  "model": "casperhansen/deepseek-r1-distill-qwen-14b-awq",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "{ \"isVehicleQuery\": true }"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 335,
    "total_tokens": 344,
    "completion_tokens": 9
  }
}

Server Log vllm log

user@41ace22b-ef83-4b25-b433-bb2964a07f70:~/fikri_td$ docker logs vllm 
INFO 07-06 19:59:34 [__init__.py:244] Automatically detected platform cuda.
INFO 07-06 19:59:41 [api_server.py:1287] vLLM API server version 0.9.1
INFO 07-06 19:59:42 [cli_args.py:309] non-default args: {'host': '0.0.0.0', 'port': 8080, 'model': 'casperhansen/deepseek-r1-distill-qwen-14b-awq', 'max_model_len': 8192}
config.json: 1.02kB [00:00, 1.98MB/s]
INFO 07-06 20:00:00 [config.py:823] This model supports multiple tasks: {'score', 'embed', 'classify', 'reward', 'generate'}. Defaulting to 'generate'.
tokenizer_config.json: 6.75kB [00:00, 11.3MB/s]
INFO 07-06 20:00:05 [awq_marlin.py:116] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 07-06 20:00:05 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=2048.
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:00<00:00, 33.6MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 485/485 [00:00<00:00, 3.80MB/s]
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 164/164 [00:00<00:00, 1.22MB/s]
WARNING 07-06 20:00:09 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234
INFO 07-06 20:00:11 [__init__.py:244] Automatically detected platform cuda.
INFO 07-06 20:00:14 [core.py:455] Waiting for init message from front-end.
INFO 07-06 20:00:14 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='casperhansen/deepseek-r1-distill-qwen-14b-awq', speculative_config=None, tokenizer='casperhansen/deepseek-r1-distill-qwen-14b-awq', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=casperhansen/deepseek-r1-distill-qwen-14b-awq, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 07-06 20:00:15 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7ff6730dd9a0>
INFO 07-06 20:00:16 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 07-06 20:00:16 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
INFO 07-06 20:00:16 [gpu_model_runner.py:1595] Starting to load model casperhansen/deepseek-r1-distill-qwen-14b-awq...
INFO 07-06 20:00:16 [gpu_model_runner.py:1600] Loading model from scratch...
INFO 07-06 20:00:16 [cuda.py:252] Using Flash Attention backend on V1 engine.
INFO 07-06 20:00:16 [weight_utils.py:292] Using model weights format ['*.safetensors']
model-00001-of-00002.safetensors:  21%|███████████████████████████���██████▉                                                                                                                                     | 1.04G/4.99G [00:25<01:27, 45.4MB/model-00001-of-00002.safetensors:  39%|███████���████████████████████████████████████████████████████████▉                                                                                                       | 1.93G/4.99G [00:40<00:56, 54.2MB/model-00001-of-00002.safetensors:  52%|██████████████████████████████████████████████���████████████████████████████████████████▋                                                                                 | 2.59G/4.99G [00:51<00:23, 101MB/model-00001-of-00002.safetensors:  79%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████���██████████▎                                   | 3.93G/4.99G [01:13<00:21, 48.5MB/model-00001-of-00002.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.99G/4.99G [01:42<00:00, 48.6MB/s]
model-00002-of-00002.safetensors:  22%|███████████████████████████████████���█▍                                                                                                                                  | 1.11G/4.99G [00:28<02:38, 24.5MB/model-00002-of-00002.safetensors:  55%|███████████████████████���████████████████████████████████████████████████████████████████████▏                                                                           | 2.74G/4.99G [00:53<00:36, 61.3MB/model-00002-of-00002.safetensors:  68%|█████████████████████████████████████████████████████████████████████████���███████████████████████████████████████▉                                                      | 3.39G/4.99G [01:04<00:24, 64.4MB/model-00002-of-00002.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.99G/4.99G [01:22<00:00, 60.4MB/s]
INFO 07-06 20:03:23 [weight_utils.py:308] Time spent downloading weights for casperhansen/deepseek-r1-distill-qwen-14b-awq: 186.424422 seconds
model.safetensors.index.json: 102kB [00:00, 178MB/s]
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.24it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.26it/s]

INFO 07-06 20:03:25 [default_loader.py:272] Loading weights took 1.76 seconds
INFO 07-06 20:03:26 [gpu_model_runner.py:1624] Model loading took 9.4016 GiB and 189.755451 seconds
INFO 07-06 20:03:39 [backends.py:462] Using cache directory: /root/.cache/vllm/torch_compile_cache/8e2e7fadc8/rank_0_0 for vLLM's torch.compile
INFO 07-06 20:03:39 [backends.py:472] Dynamo bytecode transform time: 12.71 s
INFO 07-06 20:03:44 [backends.py:161] Cache the graph of shape None for later use
INFO 07-06 20:04:41 [backends.py:173] Compiling a graph for general shape takes 60.43 s
INFO 07-06 20:05:43 [monitor.py:34] torch.compile takes 73.14 s in total
INFO 07-06 20:05:44 [gpu_worker.py:227] Available KV cache memory: 9.89 GiB
INFO 07-06 20:05:44 [kv_cache_utils.py:715] GPU KV cache size: 53,984 tokens
INFO 07-06 20:05:44 [kv_cache_utils.py:719] Maximum concurrency for 8,192 tokens per request: 6.59x
INFO 07-06 20:06:20 [gpu_model_runner.py:2048] Graph capturing finished in 36 secs, took 0.78 GiB
INFO 07-06 20:06:20 [core.py:171] init engine (profile, create kv cache, warmup model) took 174.21 seconds
INFO 07-06 20:06:21 [loggers.py:137] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 3374
INFO 07-06 20:06:22 [api_server.py:1349] Starting vLLM API server 0 on http://0.0.0.0:8080
INFO 07-06 20:06:22 [launcher.py:29] Available routes are:
INFO 07-06 20:06:22 [launcher.py:37] Route: /openapi.json, Methods: HEAD, GET
INFO 07-06 20:06:22 [launcher.py:37] Route: /docs, Methods: HEAD, GET
INFO 07-06 20:06:22 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 07-06 20:06:22 [launcher.py:37] Route: /redoc, Methods: HEAD, GET
INFO 07-06 20:06:22 [launcher.py:37] Route: /health, Methods: GET
INFO 07-06 20:06:22 [launcher.py:37] Route: /load, Methods: GET
INFO 07-06 20:06:22 [launcher.py:37] Route: /ping, Methods: POST
INFO 07-06 20:06:22 [launcher.py:37] Route: /ping, Methods: GET
INFO 07-06 20:06:22 [launcher.py:37] Route: /tokenize, Methods: POST
INFO 07-06 20:06:22 [launcher.py:37] Route: /detokenize, Methods: POST
INFO 07-06 20:06:22 [launcher.py:37] Route: /v1/models, Methods: GET
INFO 07-06 20:06:22 [launcher.py:37] Route: /version, Methods: GET
INFO 07-06 20:06:22 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
INFO 07-06 20:06:22 [launcher.py:37] Route: /v1/completions, Methods: POST
INFO 07-06 20:06:22 [launcher.py:37] Route: /v1/embeddings, Methods: POST
INFO 07-06 20:06:22 [launcher.py:37] Route: /pooling, Methods: POST
INFO 07-06 20:06:22 [launcher.py:37] Route: /classify, Methods: POST
INFO 07-06 20:06:22 [launcher.py:37] Route: /score, Methods: POST
INFO 07-06 20:06:22 [launcher.py:37] Route: /v1/score, Methods: POST
INFO 07-06 20:06:22 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
INFO 07-06 20:06:22 [launcher.py:37] Route: /rerank, Methods: POST
INFO 07-06 20:06:22 [launcher.py:37] Route: /v1/rerank, Methods: POST
INFO 07-06 20:06:22 [launcher.py:37] Route: /v2/rerank, Methods: POST
INFO 07-06 20:06:22 [launcher.py:37] Route: /invocations, Methods: POST
INFO 07-06 20:06:22 [launcher.py:37] Route: /metrics, Methods: GET
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

INFO:     127.0.0.1:58180 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-06 21:10:13 [logger.py:43] Received request chatcmpl-9337ccfb955b4b7ea65b093028dc6720: prompt: '<｜begin▁of▁sentence｜>You are a multilingual vehicle information processor. Your task:\n\n1. Check if the input text is about a vehicle or a vehicle violation.\n2. Return only true or false.\n\nExample 1: Input: "Black car not wearing a seatbelt", Output: {"isVehicleQuery": true}\nExample 2: Input: "Fried duck", Output: {"isVehicleQuery": false}\nExample 3: Input: "Red truck", Output: {"isVehicleQuery": true}\nExample 4: Input: "Motorcycle", Output: {"isVehicleQuery": true}\nExample 5: Input: "White sedan not using a seatbelt", Output: {"isVehicleQuery": true}\nExample 6: Input: "Red car not wearing a belt", Output: {"isVehicleQuery": true}\nExample 7: Input: "Boiled potatoes", Output: {"isVehicleQuery": false}\nExample 8: Input: "Black bus", Output: {"isVehicleQuery": true}\nExample 9: Input: "Black truck", Output: {"isVehicleQuery": true}\nExample 10: Input: "Black motorcycle", Output: {"isVehicleQuery": true}\nExample 11: Input: "Black car", Output: {"isVehicleQuery": true}\nExample 12: Input: "Black bus not using a seatbelt", Output: {"isVehicleQuery": true}\nExample 13: Input: "Black truck not using a seatbelt", Output: {"isVehicleQuery": true}<｜User｜>A purple car is speeding with a white license plate number B12345FGH<｜Assistant｜>', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.1, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=True, backend=None, backend_was_auto=False, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, whitespace_pattern=None, structural_tag=None), extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 07-06 21:10:13 [async_llm.py:271] Added request chatcmpl-9337ccfb955b4b7ea65b093028dc6720.
INFO:     127.0.0.1:50420 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-06 21:10:17 [logger.py:43] Received request chatcmpl-cbcfaa16cc4c49c8af92a885177def45: prompt: '<｜begin▁of▁sentence｜>You are a multilingual vehicle information processor. Your task:\n\n1. Check if the input text is about a vehicle or a vehicle violation.\n2. Return only true or false.\n\nExample 1: Input: "Black car not wearing a seatbelt", Output: {"isVehicleQuery": true}\nExample 2: Input: "Fried duck", Output: {"isVehicleQuery": false}\nExample 3: Input: "Red truck", Output: {"isVehicleQuery": true}\nExample 4: Input: "Motorcycle", Output: {"isVehicleQuery": true}\nExample 5: Input: "White sedan not using a seatbelt", Output: {"isVehicleQuery": true}\nExample 6: Input: "Red car not wearing a belt", Output: {"isVehicleQuery": true}\nExample 7: Input: "Boiled potatoes", Output: {"isVehicleQuery": false}\nExample 8: Input: "Black bus", Output: {"isVehicleQuery": true}\nExample 9: Input: "Black truck", Output: {"isVehicleQuery": true}\nExample 10: Input: "Black motorcycle", Output: {"isVehicleQuery": true}\nExample 11: Input: "Black car", Output: {"isVehicleQuery": true}\nExample 12: Input: "Black bus not using a seatbelt", Output: {"isVehicleQuery": true}\nExample 13: Input: "Black truck not using a seatbelt", Output: {"isVehicleQuery": true}<｜User｜>A purple car is speeding with a white license plate number B12345FGH<｜Assistant｜>', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.1, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=True, backend=None, backend_was_auto=False, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, whitespace_pattern=None, structural_tag=None), extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 07-06 21:10:17 [async_llm.py:271] Added request chatcmpl-cbcfaa16cc4c49c8af92a885177def45.
INFO:     127.0.0.1:50424 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-06 21:10:17 [loggers.py:118] Engine 000: Avg prompt throughput: 67.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 63.7%
INFO 07-06 21:10:20 [logger.py:43] Received request chatcmpl-6ca35456b7c94bff98130f8449f3b293: prompt: '<｜begin▁of▁sentence｜>You are a multilingual vehicle information processor. Your task:\n\n1. Check if the input text is about a vehicle or a vehicle violation.\n2. Return only true or false.\n\nExample 1: Input: "Black car not wearing a seatbelt", Output: {"isVehicleQuery": true}\nExample 2: Input: "Fried duck", Output: {"isVehicleQuery": false}\nExample 3: Input: "Red truck", Output: {"isVehicleQuery": true}\nExample 4: Input: "Motorcycle", Output: {"isVehicleQuery": true}\nExample 5: Input: "White sedan not using a seatbelt", Output: {"isVehicleQuery": true}\nExample 6: Input: "Red car not wearing a belt", Output: {"isVehicleQuery": true}\nExample 7: Input: "Boiled potatoes", Output: {"isVehicleQuery": false}\nExample 8: Input: "Black bus", Output: {"isVehicleQuery": true}\nExample 9: Input: "Black truck", Output: {"isVehicleQuery": true}\nExample 10: Input: "Black motorcycle", Output: {"isVehicleQuery": true}\nExample 11: Input: "Black car", Output: {"isVehicleQuery": true}\nExample 12: Input: "Black bus not using a seatbelt", Output: {"isVehicleQuery": true}\nExample 13: Input: "Black truck not using a seatbelt", Output: {"isVehicleQuery": true}<｜User｜>A purple car is speeding with a white license plate number B12345FGH<｜Assistant｜>', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.1, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=True, backend=None, backend_was_auto=False, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, whitespace_pattern=None, structural_tag=None), extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 07-06 21:10:20 [async_llm.py:271] Added request chatcmpl-6ca35456b7c94bff98130f8449f3b293.
INFO:     127.0.0.1:32788 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-06 21:10:27 [loggers.py:118] Engine 000: Avg prompt throughput: 33.5 tokens/s, Avg generation throughput: 0.9 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 71.6%
INFO 07-06 21:10:37 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 71.6%
INFO 07-06 21:15:09 [launcher.py:80] Shutting down FastAPI HTTP server.

triton tensorrt llm log

root@41ace22b-ef83-4b25-b433-bb2964a07f70:/opt/tritonserver# python3 /opt/tritonserver/python/openai/openai_frontend/main.py --model-repository=${MODEL_FOLDER} --tokenizer=${TOKENIZER_DIR} --backend tensorrtllm --openai-port 8080
I0707 04:16:02.583385 9623 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x73f90a000000' with size 268435456"
I0707 04:16:02.586667 9623 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0707 04:16:02.594750 9623 model_lifecycle.cc:473] "loading: postprocessing:1"
I0707 04:16:02.594826 9623 model_lifecycle.cc:473] "loading: preprocessing:1"
I0707 04:16:02.594903 9623 model_lifecycle.cc:473] "loading: tensorrt_llm:1"
I0707 04:16:02.594977 9623 model_lifecycle.cc:473] "loading: tensorrt_llm_bls:1"
I0707 04:16:04.922180 9623 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
I0707 04:16:04.922238 9623 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
I0707 04:16:04.922249 9623 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
I0707 04:16:04.922259 9623 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] participant_ids is not specified, will be automatically set
I0707 04:16:04.932885 9623 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] cross_kv_cache_fraction is not specified, error if it's encoder-decoder model, otherwise ok
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to true
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][INFO] num_nodes is not specified, will be set to 1
[TensorRT-LLM][WARNING] multi_block_mode is not specified, will be set to true
[TensorRT-LLM][WARNING] enable_context_fmha_fp32_acc is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_mode is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_cache_size is not specified, will be set to 0
[TensorRT-LLM][INFO] speculative_decoding_fast_logits is not specified, will be set to false
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa, redrafter, lookahead, eagle}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][INFO] recv_poll_period_ms is not set, will use busy loop
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.20.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4096) * 48
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
I0707 04:16:07.206905 9623 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)"
I0707 04:16:07.207360 9623 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_1 (CPU device 0)"
I0707 04:16:08.401151 9623 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)"
I0707 04:16:08.401459 9623 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_1 (CPU device 0)"
I0707 04:16:08.673497 9623 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)"
I0707 04:16:08.673684 9623 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_1 (CPU device 0)"
[TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens' correctly (set value is ${skip_special_tokens}). Set it as True by default.
[TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens' correctly (set value is ${skip_special_tokens}). Set it as True by default.
I0707 04:16:10.525566 9623 model_lifecycle.cc:849] "successfully loaded 'postprocessing'"
I0707 04:16:10.672589 9623 model_lifecycle.cc:849] "successfully loaded 'tensorrt_llm_bls'"
[TensorRT-LLM][WARNING] 'max_num_images' parameter is not set correctly (value is ${max_num_images}). Will be set to None
[TensorRT-LLM][WARNING] Don't setup 'add_special_tokens' correctly (set value is ${add_special_tokens}). Set it as True by default.
[TensorRT-LLM][WARNING] 'max_num_images' parameter is not set correctly (value is ${max_num_images}). Will be set to None
[TensorRT-LLM][WARNING] Don't setup 'add_special_tokens' correctly (set value is ${add_special_tokens}). Set it as True by default.
I0707 04:16:13.457399 9623 model_lifecycle.cc:849] "successfully loaded 'preprocessing'"
[TensorRT-LLM][INFO] Loaded engine size: 9742 MiB
[TensorRT-LLM][INFO] Engine load time 10029 ms
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 809.10 MiB for execution context memory.
[TensorRT-LLM][INFO] gatherContextLogits: 0
[TensorRT-LLM][INFO] gatherGenerationLogits: 0
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 9729 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 5.30 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 19.56 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.06 GiB, available: 11.45 GiB, extraCostMemory: 0.00 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 1759
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] before Create KVCacheManager cacheTransPreAllocaSize:0
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 128 [window size=4096]
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 10.31 GiB for max tokens in paged KV cache (56288).
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
I0707 04:16:15.965495 9623 libtensorrtllm.cc:184] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_0_0"
I0707 04:16:15.965961 9623 model_lifecycle.cc:849] "successfully loaded 'tensorrt_llm'"
I0707 04:16:15.966938 9623 model_lifecycle.cc:473] "loading: ensemble:1"
I0707 04:16:15.967914 9623 model_lifecycle.cc:849] "successfully loaded 'ensemble'"
I0707 04:16:15.968054 9623 server.cc:611] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0707 04:16:15.968102 9623 server.cc:638] 
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                                                                                        |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0707 04:16:15.968206 9623 server.cc:681] 
+------------------+---------+--------+
| Model            | Version | Status |
+------------------+---------+--------+
| ensemble         | 1       | READY  |
| postprocessing   | 1       | READY  |
| preprocessing    | 1       | READY  |
| tensorrt_llm     | 1       | READY  |
| tensorrt_llm_bls | 1       | READY  |
+------------------+---------+--------+

I0707 04:16:16.064446 9623 metrics.cc:890] "Collecting metrics for GPU 0: NVIDIA L4"
I0707 04:16:16.068474 9623 metrics.cc:783] "Collecting CPU metrics"
I0707 04:16:16.068788 9623 tritonserver.cc:2598] 
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                        |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                       |
| server_version                   | 2.59.0                                                                                                                                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logg |
|                                  | ing                                                                                                                                                                                                          |
| model_repository_path[0]         | /root/models_casperhansen-deepseek-r1-distill-qwen-14b-awq/                                                                                                                                                  |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                    |
| strict_model_config              | 0                                                                                                                                                                                                            |
| model_config_name                |                                                                                                                                                                                                              |
| rate_limit                       | OFF                                                                                                                                                                                                          |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                    |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                     |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                          |
| strict_readiness                 | 1                                                                                                                                                                                                            |
| exit_timeout                     | 30                                                                                                                                                                                                           |
| cache_enabled                    | 0                                                                                                                                                                                                            |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Found model: name='ensemble', backend='ensemble'
Found model: name='postprocessing', backend='python'
Found model: name='preprocessing', backend='python'
Found model: name='tensorrt_llm', backend='tensorrtllm'
Found model: name='tensorrt_llm_bls', backend='python'
[WARNING] Adding CORS for the following origins: ['http://localhost']
INFO:     Started server process [9623]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

Jul 07 '25 04:07 mifikri