FasterTransformer
FasterTransformer copied to clipboard
How to quantize attn_score = Q*K and in ViT's SelfAttention
Thanks for your greate works of int8 quantization for ViT, I have some problems about the quantization of ViT' SelfAttention As in transformer Attention: 1) attn_score = Q * K^T 2) out = atten_prob * V I found their quantization of 1) + 2), is matmul_q_input_quantizer and self.matmul_k_input_quantizer are belong to dynamic quantization, the scale and zero_point are obtained during the inference (on the fly), not from the calibration ?
post training dynamic quantization: https://pytorch.org/docs/stable/quantization.html?highlight=quantization#module-torch.quantization
# https://github.com/NVIDIA/FasterTransformer/blob/43ae78abfaa13a920ac1930c23615fe28c0e9819/examples/pytorch/vit/ViT-quantization/vit_int8.py#L186
class Attention()
def __init__()
if QUANT:
self.matmul_q_input_quantizer = TensorQuantizer(QuantLinear.default_quant_desc_input)
self.matmul_k_input_quantizer = TensorQuantizer(QuantLinear.default_quant_desc_input)
self.matmul_v_input_quantizer = TensorQuantizer(QuantLinear.default_quant_desc_input)
self.matmul_a_input_quantizer = TensorQuantizer(QuantLinear.default_quant_desc_input)
self.softmax_input_quantizer = TensorQuantizer(QuantLinear.default_quant_desc_input)
def forward(self, hidden_states):
if QUANT:
attention_scores = torch.matmul(self.matmul_q_input_quantizer(query_layer),
self.matmul_k_input_quantizer(key_layer.transpose(-1, -2)))
...
if QUANT:
context_layer = torch.matmul(self.matmul_a_input_quantizer(attention_probs),
self.matmul_v_input_quantizer(value_layer))
@alphaRGB They belong to calibration, not dynamic quantization.
After calibration and saving calibrated models to disk, the saved models would also contain the fixed value for matmul_(q/k/v)_input_quantizer._amax
.
When doing inference, the saved models are loaded, and the corresponding values are always fixed when doing inference or QAT.
@Njuapp thanks for your quickly response ^^, as you said, the quantization params (scale, zp) of Q, K, V are obtained based the calibration, so is it means there is no difference between Attention_activations(Q, K, V) and other activations (like nn.Linear of FFN) during quantization, they use same methods to quantize?
@alphaRGB Yes, there is no difference between Attention_activations(Q, K, V) and other activations (like nn.Linear of FFN) during quantization
Close this bug because it is inactivated. Feel free to re-open this issue if you still have any problem.