tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

Under the main branch, stress testing the in-flight Triton Server with multiple threads can result in the Triton Server getting stuck.

Open StarrickLiu opened this issue 1 year ago • 13 comments

As indicated by the title, on the main branch, I used 40 threads to simultaneously send inference requests to the in-flight Triton Server, resulting in the Triton Server getting stuck.

The specific behavior is as follows: the GPUs utilization in nvidia-smi stays at 100%, and the power consumption ranges from 80W to 95W. During this time, none of the threads receive responses. The situation is illustrated in the image below:

企业微信截图_17004678722532

After testing, when sending requests with a single thread, there is no such issue. However, when using a larger number of threads, the problem occurs after approximately one minute. If I switch back to the release/0.5.0 branch, even under continuous stress testing with 40 threads for 16 hours, the Triton Server remains healthy.

I am happy to provide more information if needed.

This is how I build the engine:

python3 build.py --model_dir=/XXXXXX/ckpt/LLaMA-7B \
                 --dtype bfloat16 \
                 --use_gpt_attention_plugin bfloat16 \
                 --use_gemm_plugin bfloat16 \
                 --output_dir /XXXXXX/LLaMA-7B/bf16/1-gpu/8k-1k \
                 --world_size 8 \
                 --tp_size 8 \
                 --max_input_len 8192 \
                 --max_output_len 1024 \
                 --max_batch_size 64 \
                 --remove_input_padding \
                 --enable_context_fmha \
                 --parallel_build \
                 --paged_kv_cache \
                 --use_inflight_batching

StarrickLiu avatar Nov 21 '23 07:11 StarrickLiu