TensorRT-LLM feat: Cohere2ForCausalLM support (Command-A, Command-R7B)

This adds support for Cohere2ForCausalLM architecture which interleaves global layers without position embedding with sliding window layers with rope positions. I also fixed the RuntimeDefaults thing not actually working in the python API (this is the first model that ever used it).

For previous discussion, see https://github.com/NVIDIA/TensorRT-LLM/issues/2912

(This PR is not 100% finished, it should load the correct sliding window config from the model config rather than hard coding it. I will add that soon)

Mar 27 '25 15:03 aikitoria

Thanks @aikitoria for contributing this.

We will make the first pass of this MR to provide early feedback also.

Thanks June

Mar 28 '25 00:03 juney-nvidia

Any update to this? I'd like to use Command-A in the near future

May 01 '25 15:05 iibw

I've still been much too busy so I didn't get around to clean up the PR. You can theoretically use the model on this branch https://github.com/aikitoria/TensorRT-LLM/tree/experiments2 which I have recently rebased to latest, I hope I can soon get to cleaning this up properly so it can be submitted.

We also still don't have the feature to disable cyclic kv cache which means block reuse with this model does not work. Very annoying for interactive chat, and we can't fix it as it requires edits to the compiled kernels.

May 01 '25 15:05 aikitoria

Closing as no activities from requester for +10 days. Feel free to reopen when you have bandwidth to work on this!

Jun 05 '25 19:06 poweiw