Po-Han Huang (NVIDIA)
Po-Han Huang (NVIDIA)
@ecilay Could you share the ONNX model that you have exported? > Also when I add argument do_constant_folding=True to above onnx conversion, the conversion to trt won't work it will...
Could you try TensorRT 9.2 release? https://github.com/NVIDIA/TensorRT/tree/release/9.2#setting-up-the-build-environment We have relaxed the MHA pattern matching constraint by a lot between 9.2 and 8.6. Thanks
You can follow these commands to install TRT 9.2 in Triton Inference Server container: https://github.com/NVIDIA/TensorRT/blob/release/9.2/docker/ubuntu-20.04.Dockerfile#L92-L95
@zerollzeng Could you file an internal tracker for this if we can repro this? I think the problem is that the gemm size is too small (only 3x4*4x1) and is...
It seems that Torch-TRT is using default stream to call TRT engine, which is not recommended. Let me ask the Torch-TRT internally about this issue. Meanwhile, could you work around...
There is a hack that should make TRT much faster for this bmm. Instead of doing `torch.bmm(param_map, vecs)`, do this instead: ``` return torch.sum(param_map * vecs.view(n, 1, 4), dim=2) ```
Could you try removing the Q/DQ ops before BiasAdd? Those are not needed and may break MatMul+bias fusion. If the performance is still worse than FP16 after removing those Q/DQs,...
   Some findings: 1. fc_qkv and fc_aout both run in high-precision. To get speed up, Q/DQ ops should be inserted before these MatMul ops. 2....
Hmm I can't see any CUDA kernels in the [MHA Quantized Nsys Report](https://drive.google.com/file/d/1PooFXmoAoTV1AeFizCA5qMhjE2sEDQ_n/view?usp=sharing). Maybe it is caused by the Cuda failure you saw? Does this error only happen when you...
For BERT we were able to see ~20% perf difference between INT8 and FP16