TensorRT-LLM [feat]: Support weight only gemm with 2bit

support weight only gemm with 2bit

Note: This pr depends on two pull requests in cutlass repo: https://github.com/NVIDIA/cutlass/pull/1512 https://github.com/NVIDIA/cutlass/pull/1517

May 09 '24 11:05 gavinchen430

Hi @gavinchen430, Nice work! Thanks for your contribution. How can I reproduce this PR? Just download the branch gavinchen430:gemm_w2a16 then build and run it locally? If you could provide guidance on running with TensorRT-LLM and performance data of some models like llama2, it would be greatly helpful. I think this would also assist maintainers in reviewing this PR.

May 14 '24 08:05 Hongbosherlock

Hi @gavinchen430, Nice work! Thanks for your contribution. How can I reproduce this PR? Just download the branch gavinchen430:gemm_w2a16 then build and run it locally? If you could provide guidance on running with TensorRT-LLM and performance data of some models like llama2, it would be greatly helpful. I think this would also assist maintainers in reviewing this PR.

We are currently writing examples detailing how to produce quantized models using the quantization toolkit and how to deploy 2-bit quantized models using this w2a16 kernel. We will open source these examples to this repository(https://github.com/bytedance/decoupleQ) recently.

May 15 '24 02:05 gavinchen430

Hi, @gavinchen430. Thank you for the contribution. Could you help providing an example in TensorRT-LLM, too? It is helpful to understand how to use the feature.

May 16 '24 06:05 byshiue

Fantasy, I'll try it later

May 16 '24 11:05 Fridayfairy

PR has not received an update in over 14 days. Adding stale label.

Feb 06 '25 10:02 github-actions[bot]