Po-Han Huang (NVIDIA)

Results 229 comments of Po-Han Huang (NVIDIA)

@ecilay Could you share the ONNX model that you have exported? > Also when I add argument do_constant_folding=True to above onnx conversion, the conversion to trt won't work it will...

Could you try TensorRT 9.2 release? https://github.com/NVIDIA/TensorRT/tree/release/9.2#setting-up-the-build-environment We have relaxed the MHA pattern matching constraint by a lot between 9.2 and 8.6. Thanks

You can follow these commands to install TRT 9.2 in Triton Inference Server container: https://github.com/NVIDIA/TensorRT/blob/release/9.2/docker/ubuntu-20.04.Dockerfile#L92-L95

@zerollzeng Could you file an internal tracker for this if we can repro this? I think the problem is that the gemm size is too small (only 3x4*4x1) and is...

It seems that Torch-TRT is using default stream to call TRT engine, which is not recommended. Let me ask the Torch-TRT internally about this issue. Meanwhile, could you work around...

There is a hack that should make TRT much faster for this bmm. Instead of doing `torch.bmm(param_map, vecs)`, do this instead: ``` return torch.sum(param_map * vecs.view(n, 1, 4), dim=2) ```

Could you try removing the Q/DQ ops before BiasAdd? Those are not needed and may break MatMul+bias fusion. If the performance is still worse than FP16 after removing those Q/DQs,...

![2023-07-25 10_22_42-Window](https://github.com/NVIDIA/TensorRT/assets/53919306/0b041596-cb5b-4a1a-851a-f7152bd2c76b) ![2023-07-25 10_19_35-Window](https://github.com/NVIDIA/TensorRT/assets/53919306/9930a964-48c8-4870-8dba-42f95eb608e3) ![2023-07-25 10_22_28-Window](https://github.com/NVIDIA/TensorRT/assets/53919306/594524cc-1a11-48f0-96a0-0ee83ce80c76) Some findings: 1. fc_qkv and fc_aout both run in high-precision. To get speed up, Q/DQ ops should be inserted before these MatMul ops. 2....

Hmm I can't see any CUDA kernels in the [MHA Quantized Nsys Report](https://drive.google.com/file/d/1PooFXmoAoTV1AeFizCA5qMhjE2sEDQ_n/view?usp=sharing). Maybe it is caused by the Cuda failure you saw? Does this error only happen when you...

For BERT we were able to see ~20% perf difference between INT8 and FP16