AMDMIGraphX
AMDMIGraphX copied to clipboard
Efficiently support mixed precision w8a16 models
Generalize the work to efficiently support w4a16, to support w8a16 (weights in an 8 bit format, e.g. int8, eventually fp8); activations in fp16 (eventually also bf16).
"Efficiently" here means that the weights will stay as int8, and they will be dequantized only in the context of the kernel that will use them; just like how it works for int4 weights.
Whats the issue here? Doesnt the 8 bit weights use a DQ like the int4? Or is it doing something different we need to handle?