torch2trt
torch2trt copied to clipboard
I got wrong output when fp16_mode is True
net.load_state_dict(weight)
net = net.eval().cuda()
x = torch.ones([1,3,112,112]).cuda()
trt_net = ttrt.torch2trt(net, [x], max_batch_size=20)
trt_net_fp16 = ttrt.torch2trt(net, [x], max_batch_size=20, fp16_mode=True)
ori_output = net(imgs)
trt_output = trt_net(imgs)
trt_fp16_output = trt_net_fp16(imgs)
print('without fp16: ', (ori_output - trt_output).max())
print(('with fp16: ', (ori_output - trt_fp16_output).max())
And I got different results.
I checked trt_fp16_output
and found that most of it are zero.
Is there anything wrong in my code? Look forward your reply, thanks!
I'm using torch==1.4.0 and python==3.7.0
Hi @marigoold ,
Thanks for reaching out. It looks like you're following the appropriate steps for conversion.
Do you mind running with
import tensorrt
trt_net_fp16 = ttrt.torch2trt(net, [x], max_batch_size=20, fp16_mode=True, log_level=trt.Logger.INFO)
And reporting the output log? This may help indicate any internal TensorRT issue.
Also, I it may not be important here, but for your output comparisons, i'd make sure to take the maximum absolute value difference. There is a slight chance that the FP32 output is actually off, but nearly all the values from trt_output are greater than original. Probably not the case, but just to be safe.
print('without fp16: ', torch.abs(ori_output - trt_output).max())
print(('with fp16: ', torch.abs(ori_output - trt_fp16_output).max())
Best, John
Thanks for your reply! I modified my code according your suggestion.
net.load_state_dict(weight)
net = net.eval().cuda()
x = torch.ones([20,3,112,112]).cuda()
trt_net = ttrt.torch2trt(net, [x], max_batch_size=20)
trt_net_fp16 = ttrt.torch2trt(net, [x], max_batch_size=20, fp16_mode=True, log_level=trt.Logger.INFO)
ori_output = net(imgs)
trt_output = trt_net(imgs)
trt_fp16_output = trt_net_fp16(imgs)
print('without fp16: ', torch.abs(ori_output - trt_output).max())
print('with fp16: ', torch.abs(ori_output - trt_fp16_output).max())
And the log is
[TensorRT] WARNING: Tensor DataType is determined at build time for tensors not marked as input or output.
[TensorRT] INFO: Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[TensorRT] INFO: Detected 1 inputs and 1 output network tensors.
without fp16: tensor(5.9605e-07, device='cuda:0', grad_fn=<MaxBackward1>)
with fp16: tensor(1.7316, device='cuda:0', grad_fn=<MaxBackward1>)
I set max_workspace_size=1<<29
, then the INFO about workspace memory disappeared, but the result was still wrong.
net.load_state_dict(weight)
net = net.eval().cuda()
x = torch.ones([20,3,112,112]).cuda()
trt_net = ttrt.torch2trt(net, [x], max_batch_size=20)
trt_net_fp16 = ttrt.torch2trt(net, [x], max_batch_size=20, fp16_mode=True, log_level=trt.Logger.INFO, max_workspace_size=1<<29)
ori_output = net(imgs)
trt_output = trt_net(imgs)
trt_fp16_output = trt_net_fp16(imgs)
print('without fp16: ', torch.abs(ori_output - trt_output).max())
print('with fp16: ', torch.abs(ori_output - trt_fp16_output).max())
[TensorRT] WARNING: Tensor DataType is determined at build time for tensors not marked as input or output.
[TensorRT] INFO: Detected 1 inputs and 1 output network tensors.
without fp16: tensor(8.3447e-07, device='cuda:0', grad_fn=<MaxBackward1>)
with fp16: tensor(1.3758, device='cuda:0', grad_fn=<MaxBackward1>)
Just now, I found that if I use another network(MobileFaceNet), the fp16_mode returns reasonable result. Is it because there is a problem with the original network structure? I was using ShuffleNet v2 with SE blocks.
[TensorRT] WARNING: Tensor DataType is determined at build time for tensors not marked as input or output.
[TensorRT] INFO: Detected 1 inputs and 1 output network tensors.
without fp16: tensor(2.6647e-07, device='cuda:0', grad_fn=<MaxBackward1>)
with fp16: tensor(0.0007, device='cuda:0', grad_fn=<MaxBackward1>)
Do you mind sharing the exact models (the mobilenet that succeeds and the one that fails)?
Also, I'm curious, what happens if you do
model_trt = torch2trt(..., strict_type_constraints=True)
Thanks for your kind reply.
I set strict_type_constraints=True
and got these logs.
[TensorRT] WARNING: Tensor DataType is determined at build time for tensors not marked as input or output.
[TensorRT] WARNING: No implementation obeys reformatting-free rules, at least 15 reformatting nodes are needed, now picking the fastest path instead.
[TensorRT] INFO: Detected 1 inputs and 1 output network tensors.
without fp16: tensor(8.3447e-07, device='cuda:0', grad_fn=<MaxBackward1>)
with fp16: tensor(1.4315, device='cuda:0', grad_fn=<MaxBackward1>)
Here are my extract models MobileFaceNet(the one succeeded) and the Shufflenet v2 with SE blocks(the one failed), and they are modified from https://github.com/TreB1eN/InsightFace_Pytorch/blob/master/model.py and https://github.com/weiaicunzai/pytorch-cifar100/blob/master/models/shufflenetv2.py.
Hi @jaybdub @marigoold Did you solved this problem? I have the same problem...
I use torch2trt fp16 mode convert two models( same architecture, different model weights). One is good and the other is bad... this maybe related to model weights?