TensorRT-LLM
TensorRT-LLM copied to clipboard
[Quenstion] TensorRT-LLM cannot do inference when sequence length > 200K?
Hi,
@byshiue @QiJune
I build a llama model with long sequence (200K tokens+), but I encouter the problems as follows:
- when doing inference with sequence <195K, it work well.
- when doing inference with sequence >200K, it raise an kernel error as follows.
How is it going?Or Does TensorRT-LLM not support to infer with sequence >200K currently? how to solve this problem?
Any help will be appreciated.
It is a known issue on TRT 9, and planned to be fixed in TRT 10. The reason is because the data size of some intermediate buffer is overflow under int32 data type.
Thanks for your reply, and expect the TRT 10 version.