Ming Wei comments

Results 11 comments of


                                            Ming Wei

Fail to build Llama-3-70B-Instruct with w4a16

Thanks for reporting the issue. This issue has been fixed and the fix will be included in the future update.

Fail to build Llama-3-70B-Instruct with w4a16

Actually, the issue should have been fixed already in the update last week (0514). @gloritygithub11 could you try with tensorrt-llm 0.10.0.dev2024051400 and let us know whether the issue is still...

Fail to build Llama-3-70B-Instruct with w4a16

@byshiue This seems a different error not related to XQA. Could you help triage and reroute the issue? Thanks.

feat: FP8 Rowwise quantization support for Cohere models

@aikitoria any update on this?

Support Cohere Command-A (Cohere2ForCausalLM arch)

Thanks for raising the issue. We are aware of the garbage result issue when kv cache reuse and sliding window attention both enabled. We are on it right now.

Support Cohere Command-A (Cohere2ForCausalLM arch)

Let me try to clarify a bit. We are working on a (somewhat complicated) solution to support alternating sliding window attention + kv cache reuse scenario. By "alternating sliding window...

Support Cohere Command-A (Cohere2ForCausalLM arch)

You are right about that. If we don't care device memory saving and offloading blocks to host, "BlockManager per window size" is not needed at all. We could simply keep...

Support Cohere Command-A (Cohere2ForCausalLM arch)

Thanks for raising the "sliding window in kv cache config" concern. We'll think about it.

Support Cohere Command-A (Cohere2ForCausalLM arch)

We don't have plans to open source these kernels for now. We will keep eye on it and consider the possibility of opening source kernels once we find it appropriate.

How to implement attention when query and value have different hidden dims?

Did you mean [multi query attention](https://arxiv.org/abs/1911.02150) or [group query attention](https://arxiv.org/pdf/2305.13245), where each q head corresponds to multiple kv heads? We have support for this use case already: https://github.com/NVIDIA/TensorRT-LLM/blob/794f61c99767fd2aa2d28709831c7a9e3501fd43/examples/llama/convert_checkpoint.py#L421 Just set...