TensorRT
TensorRT copied to clipboard
Size of model and inference time is same as FP32 after calibration/quatization step.
Description
I was trying to follow along this:
-
https://github.com/NVIDIA/TensorRT/blob/master/tools/pytorch-quantization/examples/calibrate_quant_resnet50.ipynb
-
https://github.com/NVIDIA/TensorRT/blob/master/tools/pytorch-quantization/examples/finetune_quant_resnet50.ipynb
As I understand, step 1 should result in a quantized INT8 model. So I should expect a model which is at least 2x smaller in size and 2x faster for inference. However, my model size and inference speed are both same as FP32. I print out the model and can see that the Conv2d and Convtransposed2D are replaced by quant variants.
Please help me understand this.
Environment
TensorRT Version: NVIDIA GPU: RTX2080MaxQ NVIDIA Driver Version: 515.65 CUDA Version: 11.7 CUDNN Version: Operating System: Linux - NGC Tensorrt docker Python Version (if applicable): 3.8.10 Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version): NGC Tensorrt Docker + Torch-Tensorrt installed from source
Relevant Files
Steps To Reproduce
Here is part of the code I used to quantize.
""" https://github.com/NVIDIA/TensorRT/blob/master/tools/pytorch-quantization/examples/calibrate_quant_resnet50.ipynb
Quantizing followed by fine tune training """
import os, sys, socket
import torch import torch.utils.data from torch import nn from torch.utils.data import DataLoader, Dataset from pytorch_quantization import quant_modules from pytorch_quantization import nn as quant_nn from pytorch_quantization import calib from pytorch_quantization.tensor_quant import QuantDescriptor from settings import argo_settings as settings # for argoverse from models import model_bevdetnet as model_simple import numpy as np from tqdm import tqdm import torchvision import timeit import matplotlib.pyplot as plt
first create a dataset
class CalibDataset(Dataset): """ Dataset class """
def __init__(self, path_calib_dataset_bev) -> None:
"""
Default method
"""
self.path_dataset_bev = path_calib_dataset_bev
frames = os.listdir(self.path_dataset_bev) # returns a list of all file names in the passed directory path
print("Number of calib BEV images : ", len(frames))
self.framelist = frames # save it to an object parameter
self.len = len(frames) # this will be used by __len__()
np.random.shuffle(self.framelist) # randomly shuffle all frames, no need to shuffle at test time
self.ERROR_LOGS_PRINT_ONCE = False
def __len__(self):
"""
Returns the length of the dataset / number of frames
Returns: length of dataset
"""
return self.len
def __getitem__(self, item):
"""
Function does the actual data return to the calling iterator
Args:
item (): index of data item
Returns: the train_X, and other train_Ys
"""
frame_path = self.path_dataset_bev + '/' + self.framelist[item] # form comlplete file path for the indexed frame
file = np.load(frame_path) # read in the npy file, 10 channels
# sanity check, if dataset not already in desired shape, crop from top
if file.shape != (settings.N_ROWS_RAW_BEV, settings.N_COLS_RAW_BEV, settings.N_CHANNELS_RAW_BEV):
if self.ERROR_LOGS_PRINT_ONCE is False:
print("WARNING : raw BEV shape : ", file.shape)
print("Cropping from top to {0} rows and {1} cols...".format(settings.N_ROWS_TRAIN_BEV, settings.N_COLS_TRAIN_BEV))
row_num_start = file.shape[0] - settings.N_ROWS_TRAIN_BEV
col_num_start = 0
col_num_end = settings.N_COLS_TRAIN_BEV
file = file[row_num_start:, col_num_start: col_num_end, :]
if self.ERROR_LOGS_PRINT_ONCE is False:
print("Rehaped train frame -> ", file.shape)
self.ERROR_LOGS_PRINT_ONCE = True
# ------------------------------- NOTE: Since 31/07/22, channels order in BEV is ----------------------------------------#
# 0, 1, 2, 3, 4, 5 -> Z,D,I,X,Y, ring/laser_number
#------------------------------------------------------------------------------------------------------------------------#
# extract all channels
ch_Z = file[:, :, 0]
ch_D = file[:, :, 1]
ch_I = file[:, :, 2]
ch_X = file[: ,:, 3]
ch_Y = file[:, :, 4]
ch_ring = file[:, :, 5]
if settings.ADD_BINARY_MASK is True:
train_x_normalized = np.zeros((settings.N_ROWS_TRAIN_BEV,
settings.N_COLS_TRAIN_BEV,
len(settings.TRAIN_BEV_CHANNEL_NAMES) + 1))
else:
train_x_normalized = np.zeros((settings.N_ROWS_TRAIN_BEV,
settings.N_COLS_TRAIN_BEV,
len(settings.TRAIN_BEV_CHANNEL_NAMES)))
for ch_index, train_feature_name in enumerate(settings.TRAIN_BEV_CHANNEL_NAMES):
if train_feature_name not in settings.channel_assignment_dict.keys():
print("ERROR: Selected BEV feature {0}, not found in {1}".format(train_feature_name, settings.channel_assignment_dict))
exit(-1)
else:
if train_feature_name == 'Z':
train_x_normalized[:, :, ch_index] = ch_Z / 7.
if train_feature_name == 'D':
train_x_normalized[:, :, ch_index] = ch_D / np.max(ch_D)
if train_feature_name == 'I':
train_x_normalized[:, :, ch_index] = ch_I / np.max(ch_I)
if train_feature_name == 'X':
train_x_normalized[:, :, ch_index] = ch_X / np.max(ch_X)
if train_feature_name == 'Y':
train_x_normalized[:, :, ch_index] = ch_Y / np.max(ch_Y)
if train_feature_name == 'RING':
train_x_normalized[:, :, ch_index] = ch_ring / 32
if settings.ADD_BINARY_MASK is True:
mask_r, mask_c = np.where(train_x_normalized[:, :, 0] != 0) # get pixels that have value
train_x_normalized[mask_r, mask_c, -1] = 1. # create binary mask
return torch.from_numpy(train_x_normalized).float().permute(2, 0, 1)
def collect_stats(model, data_loader, num_batches): """Feed data to the network and collect statistic"""
# Enable calibrators
for name, module in model.named_modules():
if isinstance(module, quant_nn.TensorQuantizer):
if module._calibrator is not None:
module.disable_quant()
module.enable_calib()
else:
module.disable()
# for i, (image, _) in tqdm(enumerate(data_loader), total=num_batches):
# model(image.cuda())
# if i >= num_batches:
# break
for i, image in tqdm(enumerate(data_loader), total=num_batches):
model(image.cuda())
if i >= num_batches:
break
# Disable calibrators
for name, module in model.named_modules():
if isinstance(module, quant_nn.TensorQuantizer):
if module._calibrator is not None:
module.enable_quant()
module.disable_calib()
else:
module.enable()
def compute_amax(model, **kwargs): # Load calib result for name, module in model.named_modules(): if isinstance(module, quant_nn.TensorQuantizer): if module._calibrator is not None: if isinstance(module._calibrator, calib.MaxCalibrator): module.load_calib_amax() else: module.load_calib_amax(**kwargs)
print(F"{name:40}: {module}")
model.cuda()
def quantize_to_int8(fp32_pth_model_path, quantized_model_save_path): """ """
# set default quant descriptor
quant_desc_input = QuantDescriptor(calib_method='histogram')
quant_nn.QuantConv2d.set_default_quant_desc_input(quant_desc_input)
quant_nn.QuantLinear.set_default_quant_desc_input(quant_desc_input)
quant_nn.QuantConvTranspose2d.set_default_quant_desc_input(quant_desc_input)
quant_nn.QuantAvgPool2d.set_default_quant_desc_input(quant_desc_input)
# init quantized modules
quant_modules.initialize()
# load pretrained network
network = model_simple.BevDetNetSimple(in_channels=settings.N_CHANNELS_TRAIN_BEV + 1,
out_kp_channels=settings.N_CHANNELS_PREDICTION_KP,
scale_H=2, scale_W=2, predict_3d_center=True).cuda()
if fp32_pth_model_path is not None:
network.load_state_dict(torch.load(fp32_pth_model_path, map_location="cuda:0"))
# print(network)
# create data loader
calib_dataset = CalibDataset(path_calib_dataset_bev=settings.WORKSPACE_ROOT_PATH + 'calib_data/')
calib_loader = DataLoader(dataset=calib_dataset, batch_size=2, num_workers=1)
with torch.no_grad():
collect_stats(network, calib_loader, num_batches=6)
compute_amax(network, method="percentile", percentile=99.99)
torch.save(network.state_dict(), quantized_model_save_path)
def test_calibrated_inference(path_model_int8): """ runs inference with the int8 model on some test data """
quant_modules.initialize()
# trt_ts_module = torch.jit.load(path_model_int8).eval()
network = model_simple.BevDetNetSimple(in_channels=settings.N_CHANNELS_TRAIN_BEV + 1,
out_kp_channels=settings.N_CHANNELS_PREDICTION_KP,
scale_H=2, scale_W=2, predict_3d_center=True).cuda()
if path_model_int8 is not None:
network.load_state_dict(torch.load(path_model_int8, map_location="cuda:0"))
calib_dataset = CalibDataset(path_calib_dataset_bev=settings.WORKSPACE_ROOT_PATH + 'calib_data/')
calib_loader = DataLoader(dataset=calib_dataset, batch_size=1, num_workers=1)
with torch.no_grad():
for i, data in enumerate(calib_loader):
bev_npy = data.permute(0, 2, 3, 1)[0].numpy()
bev = data.cuda()
t1 = timeit.default_timer()
kp, hwl, rot, dxdy = network(bev)
t2 = timeit.default_timer()
print("int8 inference ms --> ", (t2 - t1) * 1000)
kp_mask = torch.argmax(torch.softmax(kp, 1), 1).cpu().numpy()[0]
print(kp_mask.shape)
r, c = np.where(kp_mask == 1)
bev_npy[r, c, 0] = 1
image_path = settings.images_path + "anno_bev_int8_" + str(i) + '.png'
plt.imsave(image_path, bev_npy[:, :, 0:3])
if name == 'main':
pth_model_path = settings.WORKSPACE_ROOT_PATH + 'trained_models/pth/argo_det_020922_zdirb_416x416_50mx2_e_80.pth'
int8_model_path = settings.WORKSPACE_ROOT_PATH + 'trained_models/trt_ts/quant_model.pth'
# quantize_to_int8(pth_model_path, int8_model_path)
test_calibrated_inference(int8_model_path)
As I understand, step 1 should result in a quantized INT8 model. So I should expect a model which is at least 2x smaller in size and 2x faster for inference. However, my model size and inference speed are both same as FP32.
the QAT model is almost the same as the original FP32 model because it still contains the FP32 weights. about the inference time, can you share your QAT ONNX here?
@ttyio for viz
Hi,
Thanks for your quick response. Please find attached the file after step 1.
Best Regards Sambit argodet_qat.jit.pt https://drive.google.com/file/d/1lt1UGPY_hS3R23MV2ziq-dbrgb2Ic-Gj/view?usp=drive_web
On Tue, 13 Sept 2022 at 12:42, Zero Zeng @.***> wrote:
As I understand, step 1 should result in a quantized INT8 model. So I should expect a model which is at least 2x smaller in size and 2x faster for inference. However, my model size and inference speed are both same as FP32.
the QAT model is almost the same as the original FP32 model because it still contains the FP32 weights. about the inference time, can you share your QAT ONNX here?
@ttyio https://github.com/ttyio for viz
— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/TensorRT/issues/2321#issuecomment-1245228122, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQRX7IA4ADPHCVQQZMNNNBDV6BLBLANCNFSM6AAAAAAQKTXEKQ . You are receiving this because you authored the thread.Message ID: @.***>
Here is the original pth file. argo_det_020922_zdirb_416x416_50mx2_e_80.pth https://drive.google.com/file/d/1mgdngVrs_BC50N-IFF9Tp32mv5adDaRU/view?usp=drive_web
On Tue, 13 Sept 2022 at 13:28, Sambit Mohapatra < @.***> wrote:
Hi,
Thanks for your quick response. Please find attached the file after step 1.
Best Regards Sambit argodet_qat.jit.pt https://drive.google.com/file/d/1lt1UGPY_hS3R23MV2ziq-dbrgb2Ic-Gj/view?usp=drive_web
On Tue, 13 Sept 2022 at 12:42, Zero Zeng @.***> wrote:
As I understand, step 1 should result in a quantized INT8 model. So I should expect a model which is at least 2x smaller in size and 2x faster for inference. However, my model size and inference speed are both same as FP32.
the QAT model is almost the same as the original FP32 model because it still contains the FP32 weights. about the inference time, can you share your QAT ONNX here?
@ttyio https://github.com/ttyio for viz
— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/TensorRT/issues/2321#issuecomment-1245228122, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQRX7IA4ADPHCVQQZMNNNBDV6BLBLANCNFSM6AAAAAAQKTXEKQ . You are receiving this because you authored the thread.Message ID: @.***>
No access. Have you exported the quantized model to ONNX and inference using TensorRT?
No, I use torch-tensorrt and torchscript. Onnx export is not needed in this case, isn't it?
I shall check the permission issue.
On Wed, 14 Sep, 2022, 07:58 Zero Zeng, @.***> wrote:
No access. Have you exported the quantized model to ONNX and inference using TensorRT?
— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/TensorRT/issues/2321#issuecomment-1246279617, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQRX7IHXXSIYIBERS33ZH73V6FSPLANCNFSM6AAAAAAQKTXEKQ . You are receiving this because you authored the thread.Message ID: @.***>
No, I use torch-tensorrt and torchscript. Onnx export is not needed in this case, isn't it?
I believe you need to export to ONNX and use TRT's ONNX parser to get the best performance. @ttyio Correct me if I'm wrong
Okay, well, this is new info. Though I tend to agree with you - regarding performance, I see Torch-Tensorrt C++ performance seems to be not as good as other libraries like torch2trt. I haven't tried Torch-Tensorrt Python since my final deployment is C++. In any case, could you please share some mid-complexity example for this flow: pytorch -> onnx -> tensorrt C++? Something like U-Net for semantic segmentation.
I know there is a MNIST example but I think this is too trivial to truly understand the details of working with Tensorrt C++
Please let me know.
https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/userguide.html#export-to-onnx
closing since no activity for more than 3 weeks, please reopen if you still have question, thanks!