TensorRT-LLM
TensorRT-LLM copied to clipboard
use selected index past past key value in attention when using contin…
when using continuous kv cache, gpt_attention will only use first past_key_value instead of past_key_value[selected_indexed]. It will cause calculating result errors when the values of continous kv caches are not zeros.
@Eayne
Hi, since TensorRT-LLM becomes github firstly since last Monday, pls refresh your MR based on the latest main if you still want to contribute this.
Thanks June
Closing since no response after https://github.com/NVIDIA/TensorRT-LLM/pull/2682#issuecomment-2761036095. Feel free to reopen! @Eayne