quanto
quanto copied to clipboard
Support for FP8 Matmuls
Int8 matrix multiplication kernels are currently called on CUDA and CPU devices when activations and weights are quantized to int8. However, FP8 matmuls are not used when activations and weights are quantized to float8. Matmul is being done in full precision in that case, if I am not mistaken. What's the current situation and roadmap for using float8 matrix multiplications, for instance through _scaled_mm?