FasterTransformer icon indicating copy to clipboard operation
FasterTransformer copied to clipboard

How to quantize attn_score = Q*K and in ViT's SelfAttention

Open alphaRGB opened this issue 2 years ago • 3 comments

Thanks for your greate works of int8 quantization for ViT, I have some problems about the quantization of ViT' SelfAttention As in transformer Attention: 1) attn_score = Q * K^T 2) out = atten_prob * V I found their quantization of 1) + 2), is matmul_q_input_quantizer and self.matmul_k_input_quantizer are belong to dynamic quantization, the scale and zero_point are obtained during the inference (on the fly), not from the calibration ?

post training dynamic quantization: https://pytorch.org/docs/stable/quantization.html?highlight=quantization#module-torch.quantization

# https://github.com/NVIDIA/FasterTransformer/blob/43ae78abfaa13a920ac1930c23615fe28c0e9819/examples/pytorch/vit/ViT-quantization/vit_int8.py#L186
class Attention()

def __init__()

     if QUANT:
            self.matmul_q_input_quantizer = TensorQuantizer(QuantLinear.default_quant_desc_input)
            self.matmul_k_input_quantizer = TensorQuantizer(QuantLinear.default_quant_desc_input)
            self.matmul_v_input_quantizer = TensorQuantizer(QuantLinear.default_quant_desc_input)
            self.matmul_a_input_quantizer = TensorQuantizer(QuantLinear.default_quant_desc_input)
            self.softmax_input_quantizer = TensorQuantizer(QuantLinear.default_quant_desc_input)

def forward(self, hidden_states):
       if QUANT:
            attention_scores = torch.matmul(self.matmul_q_input_quantizer(query_layer), 
                self.matmul_k_input_quantizer(key_layer.transpose(-1, -2)))
        ...
      if QUANT:
            context_layer = torch.matmul(self.matmul_a_input_quantizer(attention_probs), 
                self.matmul_v_input_quantizer(value_layer))
      

alphaRGB avatar Aug 01 '22 07:08 alphaRGB

@alphaRGB They belong to calibration, not dynamic quantization. After calibration and saving calibrated models to disk, the saved models would also contain the fixed value for matmul_(q/k/v)_input_quantizer._amax. When doing inference, the saved models are loaded, and the corresponding values are always fixed when doing inference or QAT.

Njuapp avatar Aug 01 '22 07:08 Njuapp

@Njuapp thanks for your quickly response ^^, as you said, the quantization params (scale, zp) of Q, K, V are obtained based the calibration, so is it means there is no difference between Attention_activations(Q, K, V) and other activations (like nn.Linear of FFN) during quantization, they use same methods to quantize?

alphaRGB avatar Aug 01 '22 08:08 alphaRGB

@alphaRGB Yes, there is no difference between Attention_activations(Q, K, V) and other activations (like nn.Linear of FFN) during quantization

Njuapp avatar Aug 01 '22 08:08 Njuapp

Close this bug because it is inactivated. Feel free to re-open this issue if you still have any problem.

byshiue avatar Sep 08 '22 07:09 byshiue