quanto Add CUDA kernels for Wint4Afloat16

There are many kernels available to perform efficiently matrix multiplication using packed int4 weights and float16 inputs.

The goal of this issue is to select some of them and add them to a quanto cuda extension (that will be alongside the cpp and mps extensions under library/ext).

A promising candidate is the pair of kernels implemented recently in AWS: https://github.com/mit-han-lab/llm-awq/pull/142.

These kernels are directly inspired from TensorRT-llm, and seem very fast, yet not requiring any heavy dependencies.

The main difficulty would be to support a new packing scheme, hence a new flavor of PackedTensor: for efficiency, the packed weights are interleaved.

Mar 06 '24 14:03 dacorvo

cc @younesbelkada @SunMarc

Mar 06 '24 14:03 dacorvo

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Apr 16 '24 08:04 github-actions[bot]

I have made some progress on this (branch):

I verified that I could replicate AWQ packing format,
I added a method to unpack AWQ packed tensors (required for unit tests and fallbacks),
I verified that AWQ quantization is equivalent to quanto affine group-wise quantization,
I built a tiny extension compiled and loaded on-demand with just the gemm/gemv fast kernels,
I verified I could call gemm and get similar results to a python fallback.

I still need to:

integrate all this into an end-to-end equivalence comparison with the current quanto dequantized fp16/int4 mm,
introduce the new packing format and kernels into quanto.

For the second point I am not decided yet whether the packing should happen only when moving the model to a CUDA device or always as a default (but this complexifies the unpacking).

Apr 18 '24 09:04 dacorvo

I have successfully integrated the kernels with quanto tensors:

quantize weights with quanto,
repack the data,
pass repacked data, scale and zeropoint to the kernels,
verify the outputs are identical to torch.matmul with quanto weights.

The outputs are not perfectly aligned because the kernels do the operations in a slightly different order (and I think this is lossy). I find the errors a bit high, so it will require a bit of investigation.

I have also written a small script to test how long it takes to repack a whole model:

it takes about one minute to repack LLama-7b using the legacy awq packing function (on CPU because based on numpy),
luckily, I have rewritten it as an exercise, and with my version it takes only 0.13 s on an A10.

Apr 19 '24 15:04 dacorvo

Only had a couple of hours to spend on this, but confirmed the root cause of the mismatched outputs comes from the order of operation, but it is not something that can easily be changed, because by design the dequantizer coming from FasterTransformer assumes a float zeropoint, hence the reason why the dequantization formula is (data * scale) - scaled_zeropoint instead of (data - zeropoint) * scale. This is odd awq used this dequantization formula because they actually round the zeropoint before scaling it. This saves one byte per group when serializing, but the accuracy is lower ...

Apr 26 '24 16:04 dacorvo

After more tests this week-end the issue does not come from the order of operations, but rather reveals a bug inside the kernels.

Apr 29 '24 08:04 dacorvo

Problem solved: there is no bug in the kernels, but the data, scales and zeropoint buffer must be contiguous. That was pretty obvious, but I had not enforced it in my tests.

Apr 29 '24 11:04 dacorvo

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

May 30 '24 01:05 github-actions[bot]

Merged in #198

May 30 '24 06:05 dacorvo

quanto quanto copied to clipboard

Add CUDA kernels for Wint4Afloat16

quanto
quanto copied to clipboard