torch2trt icon indicating copy to clipboard operation
torch2trt copied to clipboard

I got wrong output when fp16_mode is True

Open marigoold opened this issue 4 years ago • 7 comments

net.load_state_dict(weight)
net = net.eval().cuda()
x = torch.ones([1,3,112,112]).cuda()
trt_net = ttrt.torch2trt(net, [x],  max_batch_size=20)
trt_net_fp16 = ttrt.torch2trt(net, [x],  max_batch_size=20, fp16_mode=True)

ori_output = net(imgs)
trt_output = trt_net(imgs)
trt_fp16_output = trt_net_fp16(imgs)
print('without fp16: ', (ori_output - trt_output).max())
print(('with fp16: ', (ori_output - trt_fp16_output).max())

And I got different results. 20200831161053

I checked trt_fp16_output and found that most of it are zero. 20200831160858 Is there anything wrong in my code? Look forward your reply, thanks!

marigoold avatar Aug 31 '20 08:08 marigoold

I'm using torch==1.4.0 and python==3.7.0

marigoold avatar Aug 31 '20 08:08 marigoold

Hi @marigoold ,

Thanks for reaching out. It looks like you're following the appropriate steps for conversion.

Do you mind running with

import tensorrt

trt_net_fp16 = ttrt.torch2trt(net, [x],  max_batch_size=20, fp16_mode=True, log_level=trt.Logger.INFO)

And reporting the output log? This may help indicate any internal TensorRT issue.

Also, I it may not be important here, but for your output comparisons, i'd make sure to take the maximum absolute value difference. There is a slight chance that the FP32 output is actually off, but nearly all the values from trt_output are greater than original. Probably not the case, but just to be safe.

print('without fp16: ', torch.abs(ori_output - trt_output).max())
print(('with fp16: ', torch.abs(ori_output - trt_fp16_output).max())

Best, John

jaybdub avatar Aug 31 '20 22:08 jaybdub

Thanks for your reply! I modified my code according your suggestion.

net.load_state_dict(weight)
net = net.eval().cuda()
x = torch.ones([20,3,112,112]).cuda()
trt_net = ttrt.torch2trt(net, [x],  max_batch_size=20)
trt_net_fp16 = ttrt.torch2trt(net, [x],  max_batch_size=20, fp16_mode=True,  log_level=trt.Logger.INFO)

ori_output = net(imgs)
trt_output = trt_net(imgs)
trt_fp16_output = trt_net_fp16(imgs)
print('without fp16: ', torch.abs(ori_output - trt_output).max())
print('with fp16: ', torch.abs(ori_output - trt_fp16_output).max())

And the log is

[TensorRT] WARNING: Tensor DataType is determined at build time for tensors not marked as input or output.
[TensorRT] INFO: Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[TensorRT] INFO: Detected 1 inputs and 1 output network tensors.
without fp16:  tensor(5.9605e-07, device='cuda:0', grad_fn=<MaxBackward1>)
with fp16:  tensor(1.7316, device='cuda:0', grad_fn=<MaxBackward1>)

I set max_workspace_size=1<<29, then the INFO about workspace memory disappeared, but the result was still wrong.

net.load_state_dict(weight)
net = net.eval().cuda()
x = torch.ones([20,3,112,112]).cuda()
trt_net = ttrt.torch2trt(net, [x],  max_batch_size=20)
trt_net_fp16 = ttrt.torch2trt(net, [x],  max_batch_size=20, fp16_mode=True,  log_level=trt.Logger.INFO, max_workspace_size=1<<29)

ori_output = net(imgs)
trt_output = trt_net(imgs)
trt_fp16_output = trt_net_fp16(imgs)
print('without fp16: ', torch.abs(ori_output - trt_output).max())
print('with fp16: ', torch.abs(ori_output - trt_fp16_output).max())
[TensorRT] WARNING: Tensor DataType is determined at build time for tensors not marked as input or output.
[TensorRT] INFO: Detected 1 inputs and 1 output network tensors.
without fp16:  tensor(8.3447e-07, device='cuda:0', grad_fn=<MaxBackward1>)
with fp16:  tensor(1.3758, device='cuda:0', grad_fn=<MaxBackward1>)

Just now, I found that if I use another network(MobileFaceNet), the fp16_mode returns reasonable result. Is it because there is a problem with the original network structure? I was using ShuffleNet v2 with SE blocks.

[TensorRT] WARNING: Tensor DataType is determined at build time for tensors not marked as input or output.
[TensorRT] INFO: Detected 1 inputs and 1 output network tensors.
without fp16:  tensor(2.6647e-07, device='cuda:0', grad_fn=<MaxBackward1>)
with fp16:  tensor(0.0007, device='cuda:0', grad_fn=<MaxBackward1>)

marigoold avatar Sep 01 '20 04:09 marigoold

Do you mind sharing the exact models (the mobilenet that succeeds and the one that fails)?

Also, I'm curious, what happens if you do

model_trt = torch2trt(..., strict_type_constraints=True)

jaybdub avatar Sep 02 '20 04:09 jaybdub

Thanks for your kind reply. I set strict_type_constraints=True and got these logs.

[TensorRT] WARNING: Tensor DataType is determined at build time for tensors not marked as input or output.
[TensorRT] WARNING: No implementation obeys reformatting-free rules, at least 15 reformatting nodes are needed, now picking the fastest path instead.
[TensorRT] INFO: Detected 1 inputs and 1 output network tensors.
without fp16:  tensor(8.3447e-07, device='cuda:0', grad_fn=<MaxBackward1>)
with fp16:  tensor(1.4315, device='cuda:0', grad_fn=<MaxBackward1>)

Here are my extract models MobileFaceNet(the one succeeded) and the Shufflenet v2 with SE blocks(the one failed), and they are modified from https://github.com/TreB1eN/InsightFace_Pytorch/blob/master/model.py and https://github.com/weiaicunzai/pytorch-cifar100/blob/master/models/shufflenetv2.py.

marigoold avatar Sep 02 '20 06:09 marigoold

Hi @jaybdub @marigoold Did you solved this problem? I have the same problem...

sdimantsd avatar Nov 16 '21 17:11 sdimantsd

I use torch2trt fp16 mode convert two models( same architecture, different model weights). One is good and the other is bad... this maybe related to model weights?

itachi1232gg avatar Aug 21 '23 07:08 itachi1232gg