torch2trt I got wrong output when fp16

net.load_state_dict(weight)
net = net.eval().cuda()
x = torch.ones([1,3,112,112]).cuda()
trt_net = ttrt.torch2trt(net, [x],  max_batch_size=20)
trt_net_fp16 = ttrt.torch2trt(net, [x],  max_batch_size=20, fp16_mode=True)

ori_output = net(imgs)
trt_output = trt_net(imgs)
trt_fp16_output = trt_net_fp16(imgs)
print('without fp16: ', (ori_output - trt_output).max())
print(('with fp16: ', (ori_output - trt_fp16_output).max())

And I got different results. 20200831161053

I checked trt_fp16_output and found that most of it are zero. 20200831160858 Is there anything wrong in my code? Look forward your reply, thanks!

Aug 31 '20 08:08 marigoold

I'm using torch==1.4.0 and python==3.7.0

Aug 31 '20 08:08 marigoold

Hi @marigoold ,

Thanks for reaching out. It looks like you're following the appropriate steps for conversion.

Do you mind running with

import tensorrt

trt_net_fp16 = ttrt.torch2trt(net, [x],  max_batch_size=20, fp16_mode=True, log_level=trt.Logger.INFO)

And reporting the output log? This may help indicate any internal TensorRT issue.

Also, I it may not be important here, but for your output comparisons, i'd make sure to take the maximum absolute value difference. There is a slight chance that the FP32 output is actually off, but nearly all the values from trt_output are greater than original. Probably not the case, but just to be safe.

print('without fp16: ', torch.abs(ori_output - trt_output).max())
print(('with fp16: ', torch.abs(ori_output - trt_fp16_output).max())

Best, John

Aug 31 '20 22:08 jaybdub

Thanks for your reply! I modified my code according your suggestion.

net.load_state_dict(weight)
net = net.eval().cuda()
x = torch.ones([20,3,112,112]).cuda()
trt_net = ttrt.torch2trt(net, [x],  max_batch_size=20)
trt_net_fp16 = ttrt.torch2trt(net, [x],  max_batch_size=20, fp16_mode=True,  log_level=trt.Logger.INFO)

ori_output = net(imgs)
trt_output = trt_net(imgs)
trt_fp16_output = trt_net_fp16(imgs)
print('without fp16: ', torch.abs(ori_output - trt_output).max())
print('with fp16: ', torch.abs(ori_output - trt_fp16_output).max())

And the log is

[TensorRT] WARNING: Tensor DataType is determined at build time for tensors not marked as input or output.
[TensorRT] INFO: Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[TensorRT] INFO: Detected 1 inputs and 1 output network tensors.
without fp16:  tensor(5.9605e-07, device='cuda:0', grad_fn=<MaxBackward1>)
with fp16:  tensor(1.7316, device='cuda:0', grad_fn=<MaxBackward1>)

I set max_workspace_size=1<<29, then the INFO about workspace memory disappeared, but the result was still wrong.

net.load_state_dict(weight)
net = net.eval().cuda()
x = torch.ones([20,3,112,112]).cuda()
trt_net = ttrt.torch2trt(net, [x],  max_batch_size=20)
trt_net_fp16 = ttrt.torch2trt(net, [x],  max_batch_size=20, fp16_mode=True,  log_level=trt.Logger.INFO, max_workspace_size=1<<29)

ori_output = net(imgs)
trt_output = trt_net(imgs)
trt_fp16_output = trt_net_fp16(imgs)
print('without fp16: ', torch.abs(ori_output - trt_output).max())
print('with fp16: ', torch.abs(ori_output - trt_fp16_output).max())

[TensorRT] WARNING: Tensor DataType is determined at build time for tensors not marked as input or output.
[TensorRT] INFO: Detected 1 inputs and 1 output network tensors.
without fp16:  tensor(8.3447e-07, device='cuda:0', grad_fn=<MaxBackward1>)
with fp16:  tensor(1.3758, device='cuda:0', grad_fn=<MaxBackward1>)

Just now, I found that if I use another network(MobileFaceNet), the fp16_mode returns reasonable result. Is it because there is a problem with the original network structure? I was using ShuffleNet v2 with SE blocks.

[TensorRT] WARNING: Tensor DataType is determined at build time for tensors not marked as input or output.
[TensorRT] INFO: Detected 1 inputs and 1 output network tensors.
without fp16:  tensor(2.6647e-07, device='cuda:0', grad_fn=<MaxBackward1>)
with fp16:  tensor(0.0007, device='cuda:0', grad_fn=<MaxBackward1>)

Sep 01 '20 04:09 marigoold

Do you mind sharing the exact models (the mobilenet that succeeds and the one that fails)?

Also, I'm curious, what happens if you do

model_trt = torch2trt(..., strict_type_constraints=True)

Sep 02 '20 04:09 jaybdub

Thanks for your kind reply. I set strict_type_constraints=True and got these logs.

[TensorRT] WARNING: Tensor DataType is determined at build time for tensors not marked as input or output.
[TensorRT] WARNING: No implementation obeys reformatting-free rules, at least 15 reformatting nodes are needed, now picking the fastest path instead.
[TensorRT] INFO: Detected 1 inputs and 1 output network tensors.
without fp16:  tensor(8.3447e-07, device='cuda:0', grad_fn=<MaxBackward1>)
with fp16:  tensor(1.4315, device='cuda:0', grad_fn=<MaxBackward1>)

Here are my extract models MobileFaceNet(the one succeeded) and the Shufflenet v2 with SE blocks(the one failed), and they are modified from https://github.com/TreB1eN/InsightFace_Pytorch/blob/master/model.py and https://github.com/weiaicunzai/pytorch-cifar100/blob/master/models/shufflenetv2.py.

Sep 02 '20 06:09 marigoold

Hi @jaybdub @marigoold Did you solved this problem? I have the same problem...

Nov 16 '21 17:11 sdimantsd

I use torch2trt fp16 mode convert two models( same architecture, different model weights). One is good and the other is bad... this maybe related to model weights?

Aug 21 '23 07:08 itachi1232gg

torch2trt
torch2trt copied to clipboard

I got wrong output when fp16_mode is True

torch2trt torch2trt copied to clipboard

I got wrong output when fp16_mode is True

torch2trt
torch2trt copied to clipboard