quanto
quanto copied to clipboard
Add CUDA kernels for Wint4Afloat16
There are many kernels available to perform efficiently matrix multiplication using packed int4
weights and float16
inputs.
The goal of this issue is to select some of them and add them to a quanto cuda
extension (that will be alongside the cpp
and mps
extensions under library/ext
).
A promising candidate is the pair of kernels implemented recently in AWS: https://github.com/mit-han-lab/llm-awq/pull/142.
These kernels are directly inspired from TensorRT-llm, and seem very fast, yet not requiring any heavy dependencies.
The main difficulty would be to support a new packing scheme, hence a new flavor of PackedTensor
: for efficiency, the packed weights are interleaved.
cc @younesbelkada @SunMarc
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
I have made some progress on this (branch):
- I verified that I could replicate AWQ packing format,
- I added a method to unpack AWQ packed tensors (required for unit tests and fallbacks),
- I verified that AWQ quantization is equivalent to quanto affine group-wise quantization,
- I built a tiny extension compiled and loaded on-demand with just the gemm/gemv fast kernels,
- I verified I could call gemm and get similar results to a python fallback.
I still need to:
- integrate all this into an end-to-end equivalence comparison with the current quanto dequantized fp16/int4 mm,
- introduce the new packing format and kernels into quanto.
For the second point I am not decided yet whether the packing should happen only when moving the model to a CUDA device or always as a default (but this complexifies the unpacking).
I have successfully integrated the kernels with quanto tensors:
- quantize weights with quanto,
- repack the data,
- pass repacked data, scale and zeropoint to the kernels,
- verify the outputs are identical to torch.matmul with quanto weights.
The outputs are not perfectly aligned because the kernels do the operations in a slightly different order (and I think this is lossy). I find the errors a bit high, so it will require a bit of investigation.
I have also written a small script to test how long it takes to repack a whole model:
- it takes about one minute to repack LLama-7b using the legacy awq packing function (on CPU because based on numpy),
- luckily, I have rewritten it as an exercise, and with my version it takes only 0.13 s on an A10.
Only had a couple of hours to spend on this, but confirmed the root cause of the mismatched outputs comes from the order of operation, but it is not something that can easily be changed, because by design the dequantizer coming from FasterTransformer assumes a float zeropoint, hence the reason why the dequantization formula is (data * scale) - scaled_zeropoint
instead of (data - zeropoint) * scale
.
This is odd awq used this dequantization formula because they actually round the zeropoint before scaling it. This saves one byte per group when serializing, but the accuracy is lower ...
After more tests this week-end the issue does not come from the order of operations, but rather reveals a bug inside the kernels.
Problem solved: there is no bug in the kernels, but the data, scales and zeropoint buffer must be contiguous. That was pretty obvious, but I had not enforced it in my tests.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Merged in #198