server
server copied to clipboard
low performance at large concurrent requests
Description low speed in large concurrent requests
| concurrent requests | 1 | 50 | 100 |
|---|---|---|---|
| TensorRT-llm | 73.36 | 193.30 | 193.81 |
| Vllm | 64.13 | 984.55 | 1246.50 |
value is TPS (token per second) result of concurrent requests 50, 100 is similar
Triton Information
- TensorRT-LLM: 0.11.0
- tensorrtllm_backend: 0.11.0
- tritonserver image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
- gpu info: NVIDIA A100 80GB PCIe
To Reproduce build model
python3 examples/llama/convert_checkpoint.py --model_dir /data/Meta-Llama-3-8B-Instruct \
--output_dir ./tllm_checkpoint_1gpu_bf16 \
--dtype bfloat16
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_bf16 \
--output_dir /data/trt-Meta-Llama-3-8B-Instruct \
--gpt_attention_plugin bfloat16 \
--gemm_plugin bfloat16 \
--max_batch_size 2048 \
--max_input_len 4096 \
--max_num_tokens 4096 \
--multiple_profiles enable \
--paged_kv_cache enable \
--use_paged_context_fmha enable
run tritonserver
git clone -b v0.11.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
mkdir -p repo/llama3
cp -r tensorrtllm_backend/all_models/inflight_batcher_llm/* repo/llama3/
cp ./trt-Meta-Llama-3-8B-Instruct/* repo/llama3/tensorrt_llm/1/
HF_LLAMA_MODEL="/data/Meta-Llama-3-8B-Instruct"
ENGINE_PATH="/data/repo/llama3/tensorrt_llm/1"
python3 tensorrtllm_backend/tools/fill_template.py -i repo/llama3/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:2048,preprocessing_instance_count:1
python3 tensorrtllm_backend/tools/fill_template.py -i repo/llama3/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:2048,postprocessing_instance_count:8
python3 tensorrtllm_backend/tools/fill_template.py -i repo/llama3/ensemble/config.pbtxt triton_max_batch_size:2048
python3 tensorrtllm_backend/tools/fill_template.py -i repo/llama3/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:2048,decoupled_mode:True,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.9,exclude_input_in_output:True,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:10000,enable_chunked_context:True,max_num_sequences:256
rm -r repo/llama3/tensorrt_llm_bls
docker run --rm -it --net host --gpus all \
--shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 \
-v $(pwd):/data \
nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3 \
tritonserver --model-repository=/data/repo/llama3 --backend-config=default-max-batch-size=2048
send request to /v2/models/ensemble/generate
Expected behavior TensorRT-llm result Expected to be faster concurrent requests 100 should faster than concurrent requests 50
would you try increasing max_num_tokens? https://nvidia.github.io/TensorRT-LLM/performance/perf-best-practices.html I to would be kind of you if you later describe the performance depending on the value of max_num_token (it has some optimal value, which likely is well above 4096, but there is definitely possible to overshoot)
tested again by increasing max_num_tokens(4096 to 409600)
build model
python3 examples/llama/convert_checkpoint.py --model_dir /data/Meta-Llama-3-8B-Instruct \
--output_dir ./tllm_checkpoint_1gpu_bf16 \
--dtype bfloat16
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_bf16 \
--output_dir /data/trt-Meta-Llama-3-8B-Instruct \
--gpt_attention_plugin bfloat16 \
--gemm_plugin bfloat16 \
--max_batch_size 2048 \
--max_input_len 4096 \
--max_num_tokens 409600 \
--multiple_profiles enable \
--paged_kv_cache enable \
--use_paged_context_fmha enable
when increas max_num_tokens we checked VRAM memory use has increased (using more KV cache)
| max_num_tokens | GPU Memory Usage |
|---|---|
| 4096 | 25590MiB |
| 409600 | 64784MiB |
this is new result of TensorRT-llm that increased max_num_tokens to 409600 tested only concurrent requests 100 by send request to /v2/models/ensemble/generate
| concurrent requests | 100 |
|---|---|
| TensorRT-llm (4096) | 193.81 |
| TensorRT-llm (409600) | 184.27 |
| Vllm | 1246.50 |
When I looked at the results, it looks have other problem
facing similar issue comparing triton-server with vLLM and TRT-LLM backend. with 24.07
one observation made with --log-verbose=1 with triton-server running with 100 concurrency, but observing that Generation/Scheduled requests, is 5 is it alright?
I0831 18:57:08.685678 1 model_instance_state.cc:969] "{\"Active Request Count\":99,\"Iteration Counter\":392,\"Max Request Count\":256,\"Runtime CPU Memory Us
age\":90260,\"Runtime GPU Memory Usage\":2045966240,\"Runtime Pinned Memory Usage\":562149636,\"Timestamp\":\"08-31-2024 18:57:08\",\"Context Requests\":0,\"Generation Requests\":5,\"MicroBatch ID\":0,\"Paused Requests\":0,\"Scheduled Requests\":5,\"Total Context Tokens\":0,\"Free KV cache blocks\":9,\"Max KV cache blocks\":40,\"Tokens per KV cache block\":64,\"Used KV cache blocks\":31}"
Also observing that the client is receiving the response is count of ~5s, so inference is happening with 5 requests, thus concluding that triton server is not handling concurrency rightly
Also tried this with tensorrt_llm triton config file with different queue delay parameter and max-batch-size, build the trt engine also with similar max-batch-size but it didn't help
name: "tensorrt_llm"
backend: "tensorrtllm"
max_batch_size: 100
model_transaction_policy {
decoupled: False
}
dynamic_batching {
max_queue_delay_microseconds: 1000000
}
parameters {
key: "max_batch_size"
value: {string_value: "100"}
}
experimented with various dynami_batching strategies parameters but doesn't help
if the issue I am facing is totally different , I will create a new issue.
could tokenizer or another component of the stack be a bottlneck? similar to https://github.com/triton-inference-server/server/issues/6894 ?
it looks like same problem @manickavela29
I0903 07:34:54.755164 1 model_instance_state.cc:969] "{\"Active Request Count\":80,\"Iteration Counter\":14522,\"Max Request Count\":2048,\"Runtime CPU Memory Usage\":721044,\"Runtime GPU Memory Usage\":50039058456,\"Runtime Pinned Memory Usage\":739100676,\"Timestamp\":\"09-03-2024 07:34:54\",\"Context Requests\":0,\"Generation Requests\":3,\"MicroBatch ID\":0,\"Paused Requests\":0,\"Scheduled Requests\":3,\"Total Context Tokens\":0,\"Free KV cache blocks\":24,\"Max KV cache blocks\":40,\"Tokens per KV cache block\":64,\"Used KV cache blocks\":16}"