Efficiently support mixed precision w8a16 models

Open hgaspar opened this issue 9 months ago • 1 comments

Generalize the work to efficiently support w4a16, to support w8a16 (weights in an 8 bit format, e.g. int8, eventually fp8); activations in fp16 (eventually also bf16).

"Efficiently" here means that the weights will stay as int8, and they will be dequantized only in the context of the kernel that will use them; just like how it works for int4 weights.

Feb 20 '25 22:02 hgaspar

Whats the issue here? Doesnt the 8 bit weights use a DQ like the int4? Or is it doing something different we need to handle?

Feb 21 '25 22:02 pfultz2