TensorRT-LLM How to change num_beams over multiple runs ?

System Info

NVIDIA RTX A6000

Who can help?

@juney-nvidia

Hi

I'm interested in using TensorRT-LLM for multiple inference inferences, but I'd like to be able to adjust the num_beams parameter as needed. However, when I attempt to do this, I'm encountering an error. Is there a way to make this work ?

tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] Failed to process the request(s) for model instance 'whisper_0_0', message: RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Decoder is configured with beam width 3, but 1 was given (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/layers/dynamicDecodeLayer.cpp:211

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Run the whisper example there and change the num_beam there dynamically over several consecutive inferences.

Expected behavior

I should adjust the num_beam

actual behavior

I'm encountering an error.

tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] Failed to process the request(s) for model instance 'whisper_0_0', message: RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Decoder is configured with beam width 3, but 1 was given (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/layers/dynamicDecodeLayer.cpp:211

additional notes

In the error message, I first run an inference with a num_beam=3 and the next one with a num_beam=1

May 25 '24 08:05 OValery16

System Info

NVIDIA RTX A6000

Who can help?

@juney-nvidia

Hi

I'm interested in using TensorRT-LLM for multiple inference inferences, but I'd like to be able to adjust the num_beams parameter as needed. However, when I attempt to do this, I'm encountering an error. Is there a way to make this work ?

tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] Failed to process the request(s) for model instance 'whisper_0_0', message: RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Decoder is configured with beam width 3, but 1 was given (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/layers/dynamicDecodeLayer.cpp:211

Information

[x] The official example scripts

[ ] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)

[ ] My own task or dataset (give details below)

Reproduction

Run the whisper example there and change the num_beam there dynamically over several consecutive inferences.

Expected behavior

I should adjust the num_beam

actual behavior

I'm encountering an error.

tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] Failed to process the request(s) for model instance 'whisper_0_0', message: RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Decoder is configured with beam width 3, but 1 was given (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/layers/dynamicDecodeLayer.cpp:211

additional notes

In the error message, I first run an inference with a num_beam=3 and the next one with a num_beam=1

@OValery16 Would you mind sharing how you do it? Look like you are using tritonclient?

May 27 '24 02:05 yuekaizhang

Hi @OValery16 do u still have further issue or question now? If not, we'll close it soon.

Nov 14 '24 06:11 nv-guomingz