Po-Han Huang (NVIDIA) comments

Results 229 comments of


                                            Po-Han Huang (NVIDIA)

[castLayer.cpp::validate::33] Error Code 2: Internal Error (Assertion !mOutputTypes.at(0).hasValue() || mOutputTypes.at(0).value() == params.toType failed. )

@ecilay Could you share the ONNX model that you have exported? > Also when I add argument do_constant_folding=True to above onnx conversion, the conversion to trt won't work it will...

Custom Attention implementation not well optimised by TensorRT

Could you try TensorRT 9.2 release? https://github.com/NVIDIA/TensorRT/tree/release/9.2#setting-up-the-build-environment We have relaxed the MHA pattern matching constraint by a lot between 9.2 and 8.6. Thanks

Custom Attention implementation not well optimised by TensorRT

You can follow these commands to install TRT 9.2 in Triton Inference Server container: https://github.com/NVIDIA/TensorRT/blob/release/9.2/docker/ubuntu-20.04.Dockerfile#L92-L95

poor performance of batched matmul for larger batch sizes

@zerollzeng Could you file an internal tracker for this if we can repro this? I think the problem is that the gemm size is too small (only 3x4*4x1) and is...

Custom Attention implementation not well optimised by TensorRT

It seems that Torch-TRT is using default stream to call TRT engine, which is not recommended. Let me ask the Torch-TRT internally about this issue. Meanwhile, could you work around...

poor performance of batched matmul for larger batch sizes

There is a hack that should make TRT much faster for this bmm. Instead of doing `torch.bmm(param_map, vecs)`, do this instead: ``` return torch.sum(param_map * vecs.view(n, 1, 4), dim=2) ```

Performance Discrepancy between Quantized ONNX Model and FP16 Model

Could you try removing the Q/DQ ops before BiasAdd? Those are not needed and may break MatMul+bias fusion. If the performance is still worse than FP16 after removing those Q/DQs,...

Performance Discrepancy between Quantized ONNX Model and FP16 Model

![2023-07-25 10_22_42-Window](https://github.com/NVIDIA/TensorRT/assets/53919306/0b041596-cb5b-4a1a-851a-f7152bd2c76b) ![2023-07-25 10_19_35-Window](https://github.com/NVIDIA/TensorRT/assets/53919306/9930a964-48c8-4870-8dba-42f95eb608e3) ![2023-07-25 10_22_28-Window](https://github.com/NVIDIA/TensorRT/assets/53919306/594524cc-1a11-48f0-96a0-0ee83ce80c76) Some findings: 1. fc_qkv and fc_aout both run in high-precision. To get speed up, Q/DQ ops should be inserted before these MatMul ops. 2....

Performance Discrepancy between Quantized ONNX Model and FP16 Model

Hmm I can't see any CUDA kernels in the [MHA Quantized Nsys Report](https://drive.google.com/file/d/1PooFXmoAoTV1AeFizCA5qMhjE2sEDQ_n/view?usp=sharing). Maybe it is caused by the Cuda failure you saw? Does this error only happen when you...

Performance Discrepancy between Quantized ONNX Model and FP16 Model

For BERT we were able to see ~20% perf difference between INT8 and FP16