TensorRT In trt10.0.1, these two APIs: setPrecision and setOutputType do not work

Description

We have a model that overflows when using fp16, so we use layer-precision to limit it and let some layers use fp32. It worked in version 8.6 and we could infer normal results. But after upgrading to 10.0.1, we found that the model output overflowed. Using polygraphy, we found that nan was already generated at the first overflow location (Is setprecison and setoutputType invalid?)

Environment

TensorRT Version: 10.0.1 NVIDIA GPU: 3090 & 3080 NVIDIA Driver Version: 550 CUDA Version: cuda-12.2

Steps To Reproduce

my code is like this:

for (int32_t layerIdx = 0; layerIdx < network.getNbLayers(); ++layerIdx) {
    auto *layer = network.getLayer(layerIdx);
    auto const layerName = layer->getName();
    nvinfer1::DataType dataType;
    if (matchLayerPrecision(layerPrecisions, layerName, &dataType)) { // Function to determine whether to limit the precision
        layer->setPrecision(dataType);
        int32_t layerOutNb = layer->getNbOutputs();
        for (int32_t outputIdx = 0; outputIdx < layerOutNb; outputIdx++) {
            layer->setOutputType(outputIdx, dataType);
        }}}

By the way, I have already set kOBEY_PRECISION_CONSTRAINTS env.config_->setFlag(nvinfer1::BuilderFlag::kOBEY_PRECISION_CONSTRAINTS);

Jun 13 '24 10:06 2730gf

I suggest use trtexec --layerOutputTypes=spec --layerPrecisions=spec --precisionConstraints=spec --fp16 --verbose --onnx=spec

Jun 15 '24 14:06 lix19937

I have done this, and it works on 8.6, but fails on 10.0.1:

export layer_precision="p2o.Pow.0:fp32,p2o.Pow.2:fp32..."
trtexec  --fp16 --onnx=sample.onnx --precisionConstraints="obey" --layerPrecisions=${layer_precision} --layerOutputTypes=${layer_precision}  --saveEngine=sample.trt
trtexec --loadEngine=sample.trt  --dumpOutput --loadInputs=...

Jun 19 '24 09:06 2730gf

On trt10.0.1, try to use

trtexec  --fp16 --onnx=sample.onnx --precisionConstraints="obey" --layerPrecisions=${layer_precision} --layerOutputTypes=${layer_precision}  --saveEngine=sample.trt --builderOptimizationLevel=5

Jun 23 '24 10:06 lix19937

I have added --builderOptimizationLevel=5, but it still overflows

Jun 24 '24 05:06 2730gf

You can compare the tactic between two version.

Jun 25 '24 01:06 lix19937

Thank you very much for your reply，after setting builderOptimizationLevel to 5, cache cannot be generated in trt86, but can be generated in trt10. In trt10, I can see the name of the strategy is: sm80_xmma_gemm_f32f32_f32f32_f32_nn_n_tilesize32x32x8_stage3_warpsize1x2x1_ffma_aligna4_alignc4; from the name, this is already a kernel using fp32? Is there any other way to continue to locate the problem？

Jun 26 '24 12:06 2730gf

@2730gf have you also tried strongly typed network, see

https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#strongly-typed-networks

thanks!

Aug 07 '24 05:08 ttyio

@ttyio Thank you for your reply. I found that after turning on this option, bf16 precision inference will be used instead of fp16. Compared with fp16, although there is no overflow, there is also a lot of loss in latency. Is there a way to accurately limit the precision?

Aug 13 '24 12:08 2730gf