TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

🐛 [Bug] High accuracy decrase with torch f32 model and torch_tensorrt f32 model

Open kyikiwang opened this issue 3 years ago • 1 comments

Bug Description

L1 loss is too large between torch f32 and compiled torch_tensorrt model (base) root@VM-121-213-centos:/apdcephfs/share_1041553/kyikiwang/BasketDetect# python resnet_trt.py Using cache found in /root/.cache/torch/hub/pytorch_vision_v0.10.0 WARNING: [Torch-TensorRT] - Dilation not used in Max pooling converter WARNING: [Torch-TensorRT] - Dilation not used in Max pooling converter Warm up ... Start timing ... 0.0024523067 0.0024523067 0.0024523067 0.0024523067 0.0024523067 0.0024523067 0.0024523067 0.0024523067 0.0024523067 0.0024523067 Iteration 10/10, ave batch time 196.93 ms Input shape: torch.Size([32, 3, 224, 224]) Output features size: torch.Size([32, 1000]) Average batch time: 196.93 ms

To Reproduce

import torch import torchvision

torch.hub._validate_not_a_forked_repo=lambda a,b,c: True

resnet50_model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True) resnet50_model.eval()

import numpy as np import time import torch.backends.cudnn as cudnn cudnn.benchmark = True

def rn50_preprocess(): preprocess = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]) return preprocess

decode the results into ([predicted class, description], probability)

def predict(img_path, model): img = Image.open(img_path) preprocess = rn50_preprocess() input_tensor = preprocess(img) input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model

# move the input and model to GPU for speed if available
if torch.cuda.is_available():
    input_batch = input_batch.to('cuda')
    model.to('cuda')

with torch.no_grad():
    output = model(input_batch)
    # Tensor of shape 1000, with confidence scores over Imagenet's 1000 classes
    sm_output = torch.nn.functional.softmax(output[0], dim=0)

ind = torch.argmax(sm_output)
return d[str(ind.item())], sm_output[ind] #([predicted class, description], probability)

def benchmark(model, trt_model,input_shape=(1024, 1, 224, 224), dtype='fp32', nwarmup=50, nruns=10000): input_data = torch.randn(input_shape) input_data = input_data.to("cuda")

print("Warm up ...")
with torch.no_grad():
    for _ in range(nwarmup):
        features = model(input_data)
torch.cuda.synchronize()
print("Start timing ...")
timings = []
with torch.no_grad():
    for i in range(1, nruns+1):
        y1=model(input_data)
        start_time = time.time()
        features = trt_model(input_data)
        torch.cuda.synchronize()
        end_time = time.time()
        timings.append(end_time - start_time)
        print(np.mean(np.abs((y1-features).cpu().numpy())))
        if i%10==0:
            print('Iteration %d/%d, ave batch time %.2f ms'%(i, nruns, np.mean(timings)*1000))

print("Input shape:", input_data.size())
print("Output features size:", features.size())
print('Average batch time: %.2f ms'%(np.mean(timings)*1000))

model = resnet50_model.eval().to("cuda") #benchmark(model, input_shape=(128, 3, 224, 224), nruns=100)

import torch_tensorrt

The compiled module will have precision as specified by "op_precision".

Here, it will have FP32 precision.

trt_model_fp32 = torch_tensorrt.compile(model, inputs = [torch_tensorrt.Input((128, 3, 224, 224), dtype=torch.float32)], enabled_precisions = torch.float32, # Run with FP32 workspace_size = 1 << 22 )

Obtain the average time taken by a batch of input

#benchmark(trt_model_fp32, input_shape=(128, 3, 224, 224), nruns=100)

import torch_tensorrt

The compiled module will have precision as specified by "op_precision".

Here, it will have FP16 precision.

trt_model_fp16 = torch_tensorrt.compile(model, inputs = [torch_tensorrt.Input((32, 3, 224, 224), dtype=torch.float)], enabled_precisions = {torch.half}, # Run with FP32 workspace_size = 4<< 30,require_full_compilation=True )

Obtain the average time taken by a batch of input

benchmark(model,trt_model_fp16, input_shape=(32, 3, 224, 224), dtype='fp32', nruns=10)

Steps to reproduce the behavior: docker pull nvcr.io/nvidia/pytorch:22.04-py3 https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --network host -v /apdcephfs/share_1041553/kyikiwang:/apdcephfs/share_1041553/kyikiwang -it --name tensorrt2 torch_tensorrt:latest /bin/bash
folllow tutorial/resnet in this project python resnet.py

Expected behavior

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

  • Torch-TensorRT Version (e.g. 1.0.0):
  • PyTorch Version (e.g. 1.0):
  • CPU Architecture:
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, libtorch, source):
  • Build command you used (if compiling from source):
  • Are you using local sources or building from archives:
  • Python version:
  • CUDA version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

kyikiwang avatar May 14 '22 06:05 kyikiwang

@kyikiwang : Where are you comparing the predictions? I see latency benchmark comparisons in the code snippet you shared here. Can you please share the workflow you are using to reproduce the reported issue?

andi4191 avatar Jun 02 '22 19:06 andi4191

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

github-actions[bot] avatar Sep 01 '22 00:09 github-actions[bot]