Cheng Hang comments

Results 8 comments of


                                            Cheng Hang

How could I improve the inference performance?

> Hi, > > Our quantization scheme is: > > 1. FC Weights are quantized to Int8 > 2. FC Biases are quantized to Int32 > 3. Everything else is...

How could I improve the inference performance?

P.S. I ran inference on CPU and my CPU is as follows: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz

How to quantize attn_score = Q*K and in ViT's SelfAttention

@alphaRGB They belong to calibration, not dynamic quantization. After calibration and saving calibrated models to disk, the saved models would also contain the fixed value for `matmul_(q/k/v)_input_quantizer._amax`. When doing inference,...

How to quantize attn_score = Q*K and in ViT's SelfAttention

@alphaRGB Yes, there is no difference between Attention_activations(Q, K, V) and other activations (like nn.Linear of FFN) during quantization

Speed in QAT mode=1 is the same as FP16

Have you tried quant_mode=ft2? That should be faster and the speedup result listed in vit_guide.md is `ft2` results.

🐛 [Bug] Shape analysis failure when encountered dynamic fallback

@edric1261234 You can try if this PR can help: https://github.com/pytorch/TensorRT/pull/1111

[BUG] w4a8 mixed-input gemm for fine-grained quantization

Hi @jianfei-wangg , sorry to say that I could not reproduce your results. Are you using current main? I cannot run with `--g=32` or `--g=64`, and only `--g=128` or larger...

[BUG] w4a8 mixed-input gemm for fine-grained quantization

@jianfei-wangg I modifed the 55th example as you listed, but the result shows that hat fused_mixed_input_gemm costs 358.8us, which is less than unfused_dequant (1003 us) + normal FP8 gemm (276...