TensorRT-LLM [Quenstion] TensorRT-LLM cannot do inference when sequence length

[Quenstion] TensorRT-LLM cannot do inference when sequence length > 200K?

Open 1649759610 opened this issue 9 months ago • 2 comments

Hi,

@byshiue @QiJune

I build a llama model with long sequence (200K tokens+), but I encouter the problems as follows:

when doing inference with sequence <195K, it work well.
when doing inference with sequence >200K， it raise an kernel error as follows.

How is it going？Or Does TensorRT-LLM not support to infer with sequence >200K currently? how to solve this problem?

Any help will be appreciated.

Apr 29 '24 03:04 1649759610

It is a known issue on TRT 9, and planned to be fixed in TRT 10. The reason is because the data size of some intermediate buffer is overflow under int32 data type.

Apr 29 '24 06:04 byshiue

Thanks for your reply, and expect the TRT 10 version.

Apr 29 '24 06:04 1649759610

TensorRT-LLM TensorRT-LLM copied to clipboard

[Quenstion] TensorRT-LLM cannot do inference when sequence length > 200K?

TensorRT-LLM
TensorRT-LLM copied to clipboard