TensorRT-LLM [fix] avoid the overflow issue when supporting 32k sequence length

[fix] avoid the overflow issue when supporting 32k sequence length

Open llsj14 opened this issue 1 year ago • 0 comments

I found that unfused attention kernels (softmax, transpose..) can support sequence length of 32k and are largely resilient to overflow issues.

However, the addRelativeAttentionBiasUnaligned kernel employs an integer data type for indexing, which may encounter overflow when both the sequence length and maximum sequence length are set to 32k.

To resolve this, I adjusted the data type to int64_t and applied static casting.

Please review this if possible.

Feb 11 '24 12:02 llsj14

TensorRT-LLM TensorRT-LLM copied to clipboard

[fix] avoid the overflow issue when supporting 32k sequence length

TensorRT-LLM
TensorRT-LLM copied to clipboard