quanto icon indicating copy to clipboard operation
quanto copied to clipboard

Integrate marlin fp16/bf16-int4/int8 matrix multiplication kernel

Open dacorvo opened this issue 7 months ago • 5 comments

Since the introduction of mixed-precision fp16-int4 MARLIN (Mixed Auto-Regressive Linear) kernels by IST-DASLab, new mixed-precision MARLIN kernels have been introduced for other data types.

In particular, mixed-precision fp16/bf16-int4/int8 kernels have been contributed to TGI and could be integrated in optimum-quanto as well with companion Int8MarlinQBytesTensor and Int4MarlinQBitsTensor to pack the weights.

dacorvo avatar Jul 12 '24 09:07 dacorvo