tensorrtllm_backend request was blocked when gpt_model_type=inflight_fused

request was blocked when gpt_model_type=inflight_fused_batching, serving baichuan model

Open burling opened this issue 1 year ago • 4 comments

Hello,

I am currently experiencing an issue with the triton-inference-server/tensorrt_backend while trying to run a Baichuan model.

Description

I have set gpt_model_type=inflight_fused_batching in my model configuration, but when I send a request to the server on port 8000, the request stays in processing indefinitely, with no output whatsoever.

Triton Information

I use the latest commit from main branch(e8ae70c583f8353a7dfebb1b424326a633b9360e). Here is my GPU device info:

To Reproduce

Steps to reproduce the behavior:

Set gpt_model_type=inflight_fused_batching in model configuration.
Send a request to the Triton server on port 8000.
Observe that the request stays in processing with no output.
Some info may related using pstack

I would expect the server to process the request.

Thank you for your help.

Nov 27 '23 10:11 burling

tensorrtllm_backend tensorrtllm_backend copied to clipboard

request was blocked when gpt_model_type=inflight_fused_batching, serving baichuan model

Description

Triton Information

To Reproduce

tensorrtllm_backend
tensorrtllm_backend copied to clipboard