Ming Wei
Ming Wei
Thanks for reporting the issue. This issue has been fixed and the fix will be included in the future update.
Actually, the issue should have been fixed already in the update last week (0514). @gloritygithub11 could you try with tensorrt-llm 0.10.0.dev2024051400 and let us know whether the issue is still...
@byshiue This seems a different error not related to XQA. Could you help triage and reroute the issue? Thanks.
@aikitoria any update on this?
Thanks for raising the issue. We are aware of the garbage result issue when kv cache reuse and sliding window attention both enabled. We are on it right now.
Let me try to clarify a bit. We are working on a (somewhat complicated) solution to support alternating sliding window attention + kv cache reuse scenario. By "alternating sliding window...
You are right about that. If we don't care device memory saving and offloading blocks to host, "BlockManager per window size" is not needed at all. We could simply keep...
Thanks for raising the "sliding window in kv cache config" concern. We'll think about it.
We don't have plans to open source these kernels for now. We will keep eye on it and consider the possibility of opening source kernels once we find it appropriate.
Did you mean [multi query attention](https://arxiv.org/abs/1911.02150) or [group query attention](https://arxiv.org/pdf/2305.13245), where each q head corresponds to multiple kv heads? We have support for this use case already: https://github.com/NVIDIA/TensorRT-LLM/blob/794f61c99767fd2aa2d28709831c7a9e3501fd43/examples/llama/convert_checkpoint.py#L421 Just set...