TensorRT
TensorRT copied to clipboard
🐛 [Bug] High accuracy decrase with torch f32 model and torch_tensorrt f32 model
Bug Description
L1 loss is too large between torch f32 and compiled torch_tensorrt model (base) root@VM-121-213-centos:/apdcephfs/share_1041553/kyikiwang/BasketDetect# python resnet_trt.py Using cache found in /root/.cache/torch/hub/pytorch_vision_v0.10.0 WARNING: [Torch-TensorRT] - Dilation not used in Max pooling converter WARNING: [Torch-TensorRT] - Dilation not used in Max pooling converter Warm up ... Start timing ... 0.0024523067 0.0024523067 0.0024523067 0.0024523067 0.0024523067 0.0024523067 0.0024523067 0.0024523067 0.0024523067 0.0024523067 Iteration 10/10, ave batch time 196.93 ms Input shape: torch.Size([32, 3, 224, 224]) Output features size: torch.Size([32, 1000]) Average batch time: 196.93 ms
To Reproduce
import torch import torchvision
torch.hub._validate_not_a_forked_repo=lambda a,b,c: True
resnet50_model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True) resnet50_model.eval()
import numpy as np import time import torch.backends.cudnn as cudnn cudnn.benchmark = True
def rn50_preprocess(): preprocess = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]) return preprocess
decode the results into ([predicted class, description], probability)
def predict(img_path, model): img = Image.open(img_path) preprocess = rn50_preprocess() input_tensor = preprocess(img) input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model
# move the input and model to GPU for speed if available
if torch.cuda.is_available():
input_batch = input_batch.to('cuda')
model.to('cuda')
with torch.no_grad():
output = model(input_batch)
# Tensor of shape 1000, with confidence scores over Imagenet's 1000 classes
sm_output = torch.nn.functional.softmax(output[0], dim=0)
ind = torch.argmax(sm_output)
return d[str(ind.item())], sm_output[ind] #([predicted class, description], probability)
def benchmark(model, trt_model,input_shape=(1024, 1, 224, 224), dtype='fp32', nwarmup=50, nruns=10000): input_data = torch.randn(input_shape) input_data = input_data.to("cuda")
print("Warm up ...")
with torch.no_grad():
for _ in range(nwarmup):
features = model(input_data)
torch.cuda.synchronize()
print("Start timing ...")
timings = []
with torch.no_grad():
for i in range(1, nruns+1):
y1=model(input_data)
start_time = time.time()
features = trt_model(input_data)
torch.cuda.synchronize()
end_time = time.time()
timings.append(end_time - start_time)
print(np.mean(np.abs((y1-features).cpu().numpy())))
if i%10==0:
print('Iteration %d/%d, ave batch time %.2f ms'%(i, nruns, np.mean(timings)*1000))
print("Input shape:", input_data.size())
print("Output features size:", features.size())
print('Average batch time: %.2f ms'%(np.mean(timings)*1000))
model = resnet50_model.eval().to("cuda") #benchmark(model, input_shape=(128, 3, 224, 224), nruns=100)
import torch_tensorrt
The compiled module will have precision as specified by "op_precision".
Here, it will have FP32 precision.
trt_model_fp32 = torch_tensorrt.compile(model, inputs = [torch_tensorrt.Input((128, 3, 224, 224), dtype=torch.float32)], enabled_precisions = torch.float32, # Run with FP32 workspace_size = 1 << 22 )
Obtain the average time taken by a batch of input
#benchmark(trt_model_fp32, input_shape=(128, 3, 224, 224), nruns=100)
import torch_tensorrt
The compiled module will have precision as specified by "op_precision".
Here, it will have FP16 precision.
trt_model_fp16 = torch_tensorrt.compile(model, inputs = [torch_tensorrt.Input((32, 3, 224, 224), dtype=torch.float)], enabled_precisions = {torch.half}, # Run with FP32 workspace_size = 4<< 30,require_full_compilation=True )
Obtain the average time taken by a batch of input
benchmark(model,trt_model_fp16, input_shape=(32, 3, 224, 224), dtype='fp32', nruns=10)
Steps to reproduce the behavior:
docker pull nvcr.io/nvidia/pytorch:22.04-py3 https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --network host -v /apdcephfs/share_1041553/kyikiwang:/apdcephfs/share_1041553/kyikiwang -it --name tensorrt2 torch_tensorrt:latest /bin/bash
folllow tutorial/resnet in this project
python resnet.py
Expected behavior
Environment
Build information about Torch-TensorRT can be found by turning on debug messages
- Torch-TensorRT Version (e.g. 1.0.0):
- PyTorch Version (e.g. 1.0):
- CPU Architecture:
- OS (e.g., Linux):
- How you installed PyTorch (
conda,pip,libtorch, source): - Build command you used (if compiling from source):
- Are you using local sources or building from archives:
- Python version:
- CUDA version:
- GPU models and configuration:
- Any other relevant information:
Additional context
@kyikiwang : Where are you comparing the predictions? I see latency benchmark comparisons in the code snippet you shared here. Can you please share the workflow you are using to reproduce the reported issue?
This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days