Cheng Hang
Cheng Hang
> Hi, > > Our quantization scheme is: > > 1. FC Weights are quantized to Int8 > 2. FC Biases are quantized to Int32 > 3. Everything else is...
P.S. I ran inference on CPU and my CPU is as follows: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
@alphaRGB They belong to calibration, not dynamic quantization. After calibration and saving calibrated models to disk, the saved models would also contain the fixed value for `matmul_(q/k/v)_input_quantizer._amax`. When doing inference,...
@alphaRGB Yes, there is no difference between Attention_activations(Q, K, V) and other activations (like nn.Linear of FFN) during quantization
Have you tried quant_mode=ft2? That should be faster and the speedup result listed in vit_guide.md is `ft2` results.
@edric1261234 You can try if this PR can help: https://github.com/pytorch/TensorRT/pull/1111
Hi @jianfei-wangg , sorry to say that I could not reproduce your results. Are you using current main? I cannot run with `--g=32` or `--g=64`, and only `--g=128` or larger...
@jianfei-wangg I modifed the 55th example as you listed, but the result shows that hat fused_mixed_input_gemm costs 358.8us, which is less than unfused_dequant (1003 us) + normal FP8 gemm (276...