TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

[fix] avoid the overflow issue when supporting 32k sequence length

Open llsj14 opened this issue 1 year ago • 0 comments

I found that unfused attention kernels (softmax, transpose..) can support sequence length of 32k and are largely resilient to overflow issues.

However, the addRelativeAttentionBiasUnaligned kernel employs an integer data type for indexing, which may encounter overflow when both the sequence length and maximum sequence length are set to 32k.

To resolve this, I adjusted the data type to int64_t and applied static casting.

Please review this if possible.

llsj14 avatar Feb 11 '24 12:02 llsj14