Building T5 with `--debug_mode` flag causes it to not run successfully
System Info
CPU: x86_64 OS: Linux Ubuntu GPU: Nvidia A100 and A10G (through Latitude.sh and AWS EC2, respectively) TensorRT-LLM version: 0.9.0
Who can help?
No response
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
When I convert and build a T5 model (I tried t5-small and t5-base from HuggingFace for my attempt) with no --remove_input_padding and the --debug_mode build flag enabled (all other params equal to the example command for a single GPU in the example README) on an NVIDIA A100 or A10G GPU, encoder_run fails with the following error:
[04/12/2024-21:13:17] [TRT] [E] 3: [executionContext.cpp::enqueueV3::2650] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::enqueueV3::2650, condition: mContext.profileObliviousBindings.at(profileObliviousIndex) || getPtrOrNull(mOutputAllocators, profileObliviousIndex) )
The error does not occur when running on an NVIDIA H100, so seems localized to the sm86 architecture (potentially others as well, but I've only tried on sm90 and sm86).
Expected behavior
Successful run of TensorRT-LLM version of a T5 model with layer-by-layer outputs printed out for debug purposes.
actual behavior
Runtime execution failed assert failure in encoder_run, preceded by Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::enqueueV3::2650, condition: mContext.profileObliviousBindings.at(profileObliviousIndex) || getPtrOrNull(mOutputAllocators, profileObliviousIndex) ) during the execution of self.encoder_session.run(inputs, outputs, self.stream.cuda_stream)
additional notes
I tried googling the error, but results were generally unhelpful. This could be an issue stemming from TensorRT and not the TensorRT-LLM wrapper.
@symphonylyh any updates?
@varyn-woo , Apologies for the very delayed response. Is this ticket still relevant? If so, could you try the latest version to see if the issue persists?
Issue has not received an update in over 14 days. Adding stale label.
Closing this issue as stale. If the problem persists in the latest release, please feel free to open a new one. Thank you!