Bug Description

L1 loss is too large between torch f32 and compiled torch_tensorrt model (base) root@VM-121-213-centos:/apdcephfs/share_1041553/kyikiwang/BasketDetect# python resnet_trt.py Using cache found in /root/.cache/torch/hub/pytorch_vision_v0.10.0 WARNING: [Torch-TensorRT] - Dilation not used in Max pooling converter WARNING: [Torch-TensorRT] - Dilation not used in Max pooling converter Warm up ... Start timing ... 0.0024523067 0.0024523067 0.0024523067 0.0024523067 0.0024523067 0.0024523067 0.0024523067 0.0024523067 0.0024523067 0.0024523067 Iteration 10/10, ave batch time 196.93 ms Input shape: torch.Size([32, 3, 224, 224]) Output features size: torch.Size([32, 1000]) Average batch time: 196.93 ms

To Reproduce

import torch import torchvision

torch.hub._validate_not_a_forked_repo=lambda a,b,c: True

resnet50_model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True) resnet50_model.eval()

import numpy as np import time import torch.backends.cudnn as cudnn cudnn.benchmark = True

def rn50_preprocess(): preprocess = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]) return preprocess

decode the results into ([predicted class, description], probability)

def predict(img_path, model): img = Image.open(img_path) preprocess = rn50_preprocess() input_tensor = preprocess(img) input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model

# move the input and model to GPU for speed if available
if torch.cuda.is_available():
    input_batch = input_batch.to('cuda')
    model.to('cuda')

with torch.no_grad():
    output = model(input_batch)
    # Tensor of shape 1000, with confidence scores over Imagenet's 1000 classes
    sm_output = torch.nn.functional.softmax(output[0], dim=0)

ind = torch.argmax(sm_output)
return d[str(ind.item())], sm_output[ind] #([predicted class, description], probability)

def benchmark(model, trt_model,input_shape=(1024, 1, 224, 224), dtype='fp32', nwarmup=50, nruns=10000): input_data = torch.randn(input_shape) input_data = input_data.to("cuda")

print("Warm up ...")
with torch.no_grad():
    for _ in range(nwarmup):
        features = model(input_data)
torch.cuda.synchronize()
print("Start timing ...")
timings = []
with torch.no_grad():
    for i in range(1, nruns+1):
        y1=model(input_data)
        start_time = time.time()
        features = trt_model(input_data)
        torch.cuda.synchronize()
        end_time = time.time()
        timings.append(end_time - start_time)
        print(np.mean(np.abs((y1-features).cpu().numpy())))
        if i%10==0:
            print('Iteration %d/%d, ave batch time %.2f ms'%(i, nruns, np.mean(timings)*1000))

print("Input shape:", input_data.size())
print("Output features size:", features.size())
print('Average batch time: %.2f ms'%(np.mean(timings)*1000))

model = resnet50_model.eval().to("cuda") #benchmark(model, input_shape=(128, 3, 224, 224), nruns=100)

import torch_tensorrt

The compiled module will have precision as specified by "op_precision".

Here, it will have FP32 precision.

trt_model_fp32 = torch_tensorrt.compile(model, inputs = [torch_tensorrt.Input((128, 3, 224, 224), dtype=torch.float32)], enabled_precisions = torch.float32, # Run with FP32 workspace_size = 1 << 22 )

Obtain the average time taken by a batch of input

#benchmark(trt_model_fp32, input_shape=(128, 3, 224, 224), nruns=100)

import torch_tensorrt

The compiled module will have precision as specified by "op_precision".

Here, it will have FP16 precision.

trt_model_fp16 = torch_tensorrt.compile(model, inputs = [torch_tensorrt.Input((32, 3, 224, 224), dtype=torch.float)], enabled_precisions = {torch.half}, # Run with FP32 workspace_size = 4<< 30,require_full_compilation=True )

Obtain the average time taken by a batch of input

benchmark(model,trt_model_fp16, input_shape=(32, 3, 224, 224), dtype='fp32', nruns=10)

Steps to reproduce the behavior: docker pull nvcr.io/nvidia/pytorch:22.04-py3 https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --network host -v /apdcephfs/share_1041553/kyikiwang:/apdcephfs/share_1041553/kyikiwang -it --name tensorrt2 torch_tensorrt:latest /bin/bash
folllow tutorial/resnet in this project python resnet.py

Expected behavior

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

Torch-TensorRT Version (e.g. 1.0.0):
PyTorch Version (e.g. 1.0):
CPU Architecture:
OS (e.g., Linux):
How you installed PyTorch (conda, pip, libtorch, source):
Build command you used (if compiling from source):
Are you using local sources or building from archives:
Python version:
CUDA version:
GPU models and configuration:
Any other relevant information:

Additional context

May 14 '22 06:05 kyikiwang

@kyikiwang : Where are you comparing the predictions? I see latency benchmark comparisons in the code snippet you shared here. Can you please share the workflow you are using to reproduce the reported issue?

Jun 02 '22 19:06 andi4191

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

Sep 01 '22 00:09 github-actions[bot]

TensorRT
TensorRT copied to clipboard

🐛 [Bug] High accuracy decrase with torch f32 model and torch_tensorrt f32 model

Bug Description

To Reproduce

decode the results into ([predicted class, description], probability)

The compiled module will have precision as specified by "op_precision".

Here, it will have FP32 precision.

Obtain the average time taken by a batch of input

The compiled module will have precision as specified by "op_precision".

Here, it will have FP16 precision.

Obtain the average time taken by a batch of input

Expected behavior

Environment

Additional context

TensorRT TensorRT copied to clipboard

🐛 [Bug] High accuracy decrase with torch f32 model and torch_tensorrt f32 model

Bug Description

To Reproduce

decode the results into ([predicted class, description], probability)

The compiled module will have precision as specified by "op_precision".

Here, it will have FP32 precision.

Obtain the average time taken by a batch of input

The compiled module will have precision as specified by "op_precision".

Here, it will have FP16 precision.

Obtain the average time taken by a batch of input

Expected behavior

Environment

Additional context

TensorRT
TensorRT copied to clipboard