tensorrtllm_backend
tensorrtllm_backend copied to clipboard
Under the main branch, stress testing the in-flight Triton Server with multiple threads can result in the Triton Server getting stuck.
As indicated by the title, on the main branch, I used 40 threads to simultaneously send inference requests to the in-flight Triton Server, resulting in the Triton Server getting stuck.
The specific behavior is as follows: the GPUs utilization in nvidia-smi stays at 100%, and the power consumption ranges from 80W to 95W. During this time, none of the threads receive responses. The situation is illustrated in the image below:
After testing, when sending requests with a single thread, there is no such issue. However, when using a larger number of threads, the problem occurs after approximately one minute. If I switch back to the release/0.5.0 branch, even under continuous stress testing with 40 threads for 16 hours, the Triton Server remains healthy.
I am happy to provide more information if needed.
This is how I build the engine:
python3 build.py --model_dir=/XXXXXX/ckpt/LLaMA-7B \
--dtype bfloat16 \
--use_gpt_attention_plugin bfloat16 \
--use_gemm_plugin bfloat16 \
--output_dir /XXXXXX/LLaMA-7B/bf16/1-gpu/8k-1k \
--world_size 8 \
--tp_size 8 \
--max_input_len 8192 \
--max_output_len 1024 \
--max_batch_size 64 \
--remove_input_padding \
--enable_context_fmha \
--parallel_build \
--paged_kv_cache \
--use_inflight_batching