tensorrtllm_backend
tensorrtllm_backend copied to clipboard
request was blocked when gpt_model_type=inflight_fused_batching, serving baichuan model
Hello,
I am currently experiencing an issue with the triton-inference-server/tensorrt_backend while trying to run a Baichuan model.
Description
I have set gpt_model_type=inflight_fused_batching in my model configuration, but when I send a request to the server on port 8000, the request stays in processing indefinitely, with no output whatsoever.
Triton Information
I use the latest commit from main branch(e8ae70c583f8353a7dfebb1b424326a633b9360e). Here is my GPU device info:
To Reproduce
Steps to reproduce the behavior:
-
Set
gpt_model_type=inflight_fused_batchingin model configuration. -
Send a request to the Triton server on port 8000.
-
Observe that the request stays in processing with no output.
-
Some info may related using pstack
I would expect the server to process the request.
Thank you for your help.