How to change num_beams over multiple runs ?
System Info
NVIDIA RTX A6000
Who can help?
@juney-nvidia
Hi
I'm interested in using TensorRT-LLM for multiple inference inferences, but I'd like to be able to adjust the num_beams parameter as needed. However, when I attempt to do this, I'm encountering an error. Is there a way to make this work ?
tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] Failed to process the request(s) for model instance 'whisper_0_0', message: RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Decoder is configured with beam width 3, but 1 was given (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/layers/dynamicDecodeLayer.cpp:211
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Run the whisper example there and change the num_beam there dynamically over several consecutive inferences.
Expected behavior
I should adjust the num_beam
actual behavior
I'm encountering an error.
tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] Failed to process the request(s) for model instance 'whisper_0_0', message: RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Decoder is configured with beam width 3, but 1 was given (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/layers/dynamicDecodeLayer.cpp:211
additional notes
In the error message, I first run an inference with a num_beam=3 and the next one with a num_beam=1
System Info
NVIDIA RTX A6000
Who can help?
@juney-nvidia
Hi
I'm interested in using TensorRT-LLM for multiple inference inferences, but I'd like to be able to adjust the
num_beamsparameter as needed. However, when I attempt to do this, I'm encountering an error. Is there a way to make this work ?
tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] Failed to process the request(s) for model instance 'whisper_0_0', message: RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Decoder is configured with beam width 3, but 1 was given (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/layers/dynamicDecodeLayer.cpp:211Information
- [x] The official example scripts
- [ ] My own modified scripts
Tasks
- [x] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...)- [ ] My own task or dataset (give details below)
Reproduction
Run the whisper example there and change the num_beam there dynamically over several consecutive inferences.
Expected behavior
I should adjust the num_beam
actual behavior
I'm encountering an error.
tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] Failed to process the request(s) for model instance 'whisper_0_0', message: RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Decoder is configured with beam width 3, but 1 was given (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/layers/dynamicDecodeLayer.cpp:211additional notes
In the error message, I first run an inference with a num_beam=3 and the next one with a num_beam=1
@OValery16 Would you mind sharing how you do it? Look like you are using tritonclient?
Hi @OValery16 do u still have further issue or question now? If not, we'll close it soon.