[feat]: Support weight only gemm with 2bit
support weight only gemm with 2bit
Note: This pr depends on two pull requests in cutlass repo: https://github.com/NVIDIA/cutlass/pull/1512 https://github.com/NVIDIA/cutlass/pull/1517
Hi @gavinchen430, Nice work! Thanks for your contribution.
How can I reproduce this PR? Just download the branch gavinchen430:gemm_w2a16 then build and run it locally? If you could provide guidance on running with TensorRT-LLM and performance data of some models like llama2, it would be greatly helpful. I think this would also assist maintainers in reviewing this PR.
Hi @gavinchen430, Nice work! Thanks for your contribution. How can I reproduce this PR? Just download the branch
gavinchen430:gemm_w2a16then build and run it locally? If you could provide guidance on running with TensorRT-LLM and performance data of some models like llama2, it would be greatly helpful. I think this would also assist maintainers in reviewing this PR.
We are currently writing examples detailing how to produce quantized models using the quantization toolkit and how to deploy 2-bit quantized models using this w2a16 kernel. We will open source these examples to this repository(https://github.com/bytedance/decoupleQ) recently.
Hi, @gavinchen430. Thank you for the contribution. Could you help providing an example in TensorRT-LLM, too? It is helpful to understand how to use the feature.
Fantasy, I'll try it later
PR has not received an update in over 14 days. Adding stale label.