quanto
quanto copied to clipboard
Integrate marlin fp16/bf16-int4/int8 matrix multiplication kernel
Since the introduction of mixed-precision fp16-int4 MARLIN (Mixed Auto-Regressive Linear) kernels by IST-DASLab, new mixed-precision MARLIN kernels have been introduced for other data types.
In particular, mixed-precision fp16/bf16-int4/int8 kernels have been contributed to TGI and could be integrated in optimum-quanto
as well with companion Int8MarlinQBytesTensor
and Int4MarlinQBitsTensor
to pack the weights.