triton icon indicating copy to clipboard operation
triton copied to clipboard

INT8 / UINT8 for Quantization

Open conceptofmind opened this issue 2 years ago • 7 comments
trafficstars

Hello,

Is there a proper way to handle INT8/UINT8 for quantization? I am attempting to reproduce the functions below in order to quantize flash-attention with Triton.

def quantize_to_int8(tensor, clip_max, quant_range = 127):
    scale = quant_range / clip_max
    min_bound = - quant_range
    max_bound = quant_range
    outputs = np.clip((tensor.astype(np.float32) * scale).round(), min_bound, max_bound)
    quant_tensor = outputs.astype(np.int8)
    return quant_tensor

def quantize_to_uint8(tensor, clip_max, quant_range = 255):  #
    scale = quant_range / clip_max
    max_bound = quant_range
    outputs = np.clip((tensor.astype(np.float32) * scale).round(), 0, max_bound)
    quant_tensor = outputs.astype(np.uint8)
    return quant_tensor

Any advice would be greatly appreciated.

Thank you,

Enrico

conceptofmind avatar Feb 13 '23 05:02 conceptofmind

@yuguo68

Jokeren avatar Feb 13 '23 15:02 Jokeren

We used to have a fast code-path for it that disappeared when we simplified the IR. We have plan to extend the ExternElementwiseOp so it can accomodate these cases better

ptillet avatar Feb 13 '23 22:02 ptillet

We had a PR on the legacy backend for int 8/4/2 dequantization. Please take a look at https://github.com/openai/triton/pull/759 and the examples in test_dequantize.py. We are migrating to triton-MLIR. After the migration is complete, we will start working on quantization/dequantization ops.

yuguo68 avatar Feb 14 '23 01:02 yuguo68

@ptillet @yuguo68 Thank you for the additional information. I will review #759 and the examples in test_dequantize.py. Support for Triton int8 and uint8 dtype conversions would be greatly beneficial.

conceptofmind avatar Feb 14 '23 23:02 conceptofmind

Hi, I'm interested in implementing this. Would you be able to provide any guidance?

jon-chuang avatar Feb 20 '23 16:02 jon-chuang

Any updates on this issue? It seems all those low-bitwidth kernels cannot work in the current flow.

chhzh123 avatar May 24 '23 21:05 chhzh123

Hi @yuguo68 is it still the plan to do now on the new MLIR that we are on? Or is there an alternative already to how one can efficiently do things like weight-only (de)quantization with Triton kernels? Thanks for your help here!

tiwargau avatar Jan 16 '24 18:01 tiwargau

Following-up on this issue as well. Thanks!

jpilaul avatar Apr 02 '24 15:04 jpilaul