TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

Size of model and inference time is same as FP32 after calibration/quatization step.

Open SM1991CODES opened this issue 2 years ago • 8 comments

Description

I was trying to follow along this:

  1. https://github.com/NVIDIA/TensorRT/blob/master/tools/pytorch-quantization/examples/calibrate_quant_resnet50.ipynb

  2. https://github.com/NVIDIA/TensorRT/blob/master/tools/pytorch-quantization/examples/finetune_quant_resnet50.ipynb

As I understand, step 1 should result in a quantized INT8 model. So I should expect a model which is at least 2x smaller in size and 2x faster for inference. However, my model size and inference speed are both same as FP32. I print out the model and can see that the Conv2d and Convtransposed2D are replaced by quant variants.

Please help me understand this.

Environment

TensorRT Version: NVIDIA GPU: RTX2080MaxQ NVIDIA Driver Version: 515.65 CUDA Version: 11.7 CUDNN Version: Operating System: Linux - NGC Tensorrt docker Python Version (if applicable): 3.8.10 Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version): NGC Tensorrt Docker + Torch-Tensorrt installed from source

Relevant Files

Steps To Reproduce

Here is part of the code I used to quantize.

""" https://github.com/NVIDIA/TensorRT/blob/master/tools/pytorch-quantization/examples/calibrate_quant_resnet50.ipynb

Quantizing followed by fine tune training """

import os, sys, socket

import torch import torch.utils.data from torch import nn from torch.utils.data import DataLoader, Dataset from pytorch_quantization import quant_modules from pytorch_quantization import nn as quant_nn from pytorch_quantization import calib from pytorch_quantization.tensor_quant import QuantDescriptor from settings import argo_settings as settings # for argoverse from models import model_bevdetnet as model_simple import numpy as np from tqdm import tqdm import torchvision import timeit import matplotlib.pyplot as plt

first create a dataset

class CalibDataset(Dataset): """ Dataset class """

def __init__(self, path_calib_dataset_bev) -> None:
    """
    Default method
    """

    self.path_dataset_bev = path_calib_dataset_bev
    frames = os.listdir(self.path_dataset_bev)  # returns a list of all file names in the passed directory path
    print("Number of calib BEV images : ", len(frames))
    self.framelist = frames  # save it to an object parameter
    self.len = len(frames)  # this will be used by __len__()
    np.random.shuffle(self.framelist)  # randomly shuffle all frames, no need to shuffle at test time
    self.ERROR_LOGS_PRINT_ONCE = False

def __len__(self):
    """
    Returns the length of the dataset / number of frames
    Returns: length of dataset
    """

    return self.len


def __getitem__(self, item):
    """
    Function does the actual data return to the calling iterator
    Args:
        item (): index of data item
    Returns: the train_X, and other train_Ys
    """

    frame_path = self.path_dataset_bev + '/' + self.framelist[item]  # form comlplete file path for the indexed frame
    file = np.load(frame_path)  # read in the npy file, 10 channels

    # sanity check, if dataset not already in desired shape, crop from top
    if file.shape != (settings.N_ROWS_RAW_BEV, settings.N_COLS_RAW_BEV, settings.N_CHANNELS_RAW_BEV):
        if self.ERROR_LOGS_PRINT_ONCE is False:
            print("WARNING : raw BEV shape : ", file.shape)
            print("Cropping from top to {0} rows and {1} cols...".format(settings.N_ROWS_TRAIN_BEV, settings.N_COLS_TRAIN_BEV))

        row_num_start = file.shape[0] - settings.N_ROWS_TRAIN_BEV
        col_num_start = 0
        col_num_end = settings.N_COLS_TRAIN_BEV

        file = file[row_num_start:, col_num_start: col_num_end, :]
        
        if self.ERROR_LOGS_PRINT_ONCE is False:
            print("Rehaped train frame -> ", file.shape)
            self.ERROR_LOGS_PRINT_ONCE = True

    # ------------------------------- NOTE: Since 31/07/22, channels order in BEV is ----------------------------------------#
    # 0, 1, 2, 3, 4, 5 -> Z,D,I,X,Y, ring/laser_number
    #------------------------------------------------------------------------------------------------------------------------#

    # extract all channels
    ch_Z = file[:, :, 0]
    ch_D = file[:, :, 1]
    ch_I = file[:, :, 2]
    ch_X = file[: ,:, 3]
    ch_Y = file[:, :, 4]
    ch_ring = file[:, :, 5]

    if settings.ADD_BINARY_MASK is True:
        train_x_normalized = np.zeros((settings.N_ROWS_TRAIN_BEV, 
                                        settings.N_COLS_TRAIN_BEV, 
                                        len(settings.TRAIN_BEV_CHANNEL_NAMES) + 1))
    else:
        train_x_normalized = np.zeros((settings.N_ROWS_TRAIN_BEV, 
                                        settings.N_COLS_TRAIN_BEV, 
                                        len(settings.TRAIN_BEV_CHANNEL_NAMES)))
    
    for ch_index, train_feature_name in enumerate(settings.TRAIN_BEV_CHANNEL_NAMES):
        if train_feature_name not in settings.channel_assignment_dict.keys():
            print("ERROR: Selected BEV feature {0}, not found in {1}".format(train_feature_name, settings.channel_assignment_dict))
            exit(-1)
        else:
            if train_feature_name == 'Z':
                train_x_normalized[:, :, ch_index] = ch_Z / 7.
            if train_feature_name == 'D':
                train_x_normalized[:, :, ch_index] = ch_D / np.max(ch_D)
            if train_feature_name == 'I':    
                train_x_normalized[:, :, ch_index] = ch_I / np.max(ch_I)
            if train_feature_name == 'X':    
                train_x_normalized[:, :, ch_index] = ch_X / np.max(ch_X)
            if train_feature_name == 'Y':    
                train_x_normalized[:, :, ch_index] = ch_Y / np.max(ch_Y)
            if train_feature_name == 'RING':
                train_x_normalized[:, :, ch_index] = ch_ring / 32
    
    if settings.ADD_BINARY_MASK is True:
        mask_r, mask_c = np.where(train_x_normalized[:, :, 0] != 0)  # get pixels that have value
        train_x_normalized[mask_r, mask_c, -1] = 1.  # create binary mask
    
    return torch.from_numpy(train_x_normalized).float().permute(2, 0, 1)

def collect_stats(model, data_loader, num_batches): """Feed data to the network and collect statistic"""

# Enable calibrators
for name, module in model.named_modules():
    if isinstance(module, quant_nn.TensorQuantizer):
        if module._calibrator is not None:
            module.disable_quant()
            module.enable_calib()
        else:
            module.disable()

# for i, (image, _) in tqdm(enumerate(data_loader), total=num_batches):
#     model(image.cuda())
#     if i >= num_batches:
#         break
for i, image in tqdm(enumerate(data_loader), total=num_batches):
    model(image.cuda())
    if i >= num_batches:
        break

# Disable calibrators
for name, module in model.named_modules():
    if isinstance(module, quant_nn.TensorQuantizer):
        if module._calibrator is not None:
            module.enable_quant()
            module.disable_calib()
        else:
            module.enable()

def compute_amax(model, **kwargs): # Load calib result for name, module in model.named_modules(): if isinstance(module, quant_nn.TensorQuantizer): if module._calibrator is not None: if isinstance(module._calibrator, calib.MaxCalibrator): module.load_calib_amax() else: module.load_calib_amax(**kwargs)

print(F"{name:40}: {module}")

model.cuda()

def quantize_to_int8(fp32_pth_model_path, quantized_model_save_path): """ """

# set default quant descriptor
quant_desc_input = QuantDescriptor(calib_method='histogram')
quant_nn.QuantConv2d.set_default_quant_desc_input(quant_desc_input)
quant_nn.QuantLinear.set_default_quant_desc_input(quant_desc_input)
quant_nn.QuantConvTranspose2d.set_default_quant_desc_input(quant_desc_input)
quant_nn.QuantAvgPool2d.set_default_quant_desc_input(quant_desc_input)

# init quantized modules
quant_modules.initialize()

# load pretrained network
network = model_simple.BevDetNetSimple(in_channels=settings.N_CHANNELS_TRAIN_BEV + 1, 
                                        out_kp_channels=settings.N_CHANNELS_PREDICTION_KP, 
                                        scale_H=2, scale_W=2, predict_3d_center=True).cuda()
if fp32_pth_model_path is not None:
    network.load_state_dict(torch.load(fp32_pth_model_path, map_location="cuda:0"))

# print(network)

# create data loader
calib_dataset = CalibDataset(path_calib_dataset_bev=settings.WORKSPACE_ROOT_PATH + 'calib_data/')
calib_loader = DataLoader(dataset=calib_dataset, batch_size=2, num_workers=1)

with torch.no_grad():
    collect_stats(network, calib_loader, num_batches=6)
    compute_amax(network, method="percentile", percentile=99.99)

torch.save(network.state_dict(), quantized_model_save_path)

def test_calibrated_inference(path_model_int8): """ runs inference with the int8 model on some test data """

quant_modules.initialize()

# trt_ts_module = torch.jit.load(path_model_int8).eval()
network = model_simple.BevDetNetSimple(in_channels=settings.N_CHANNELS_TRAIN_BEV + 1, 
                                        out_kp_channels=settings.N_CHANNELS_PREDICTION_KP, 
                                        scale_H=2, scale_W=2, predict_3d_center=True).cuda()
if path_model_int8 is not None:
    network.load_state_dict(torch.load(path_model_int8, map_location="cuda:0"))

calib_dataset = CalibDataset(path_calib_dataset_bev=settings.WORKSPACE_ROOT_PATH + 'calib_data/')
calib_loader = DataLoader(dataset=calib_dataset, batch_size=1, num_workers=1)

with torch.no_grad():
    for i, data in enumerate(calib_loader):
        bev_npy = data.permute(0, 2, 3, 1)[0].numpy()
        bev = data.cuda()

        t1 = timeit.default_timer()
        kp, hwl, rot, dxdy = network(bev)
        t2 = timeit.default_timer()
        print("int8 inference ms --> ", (t2 - t1) * 1000)
        kp_mask = torch.argmax(torch.softmax(kp, 1), 1).cpu().numpy()[0]
        print(kp_mask.shape)

        r, c = np.where(kp_mask == 1)
        bev_npy[r, c, 0] = 1
        image_path = settings.images_path + "anno_bev_int8_" + str(i) + '.png'
        plt.imsave(image_path, bev_npy[:, :, 0:3])

if name == 'main':

pth_model_path = settings.WORKSPACE_ROOT_PATH + 'trained_models/pth/argo_det_020922_zdirb_416x416_50mx2_e_80.pth'
int8_model_path = settings.WORKSPACE_ROOT_PATH + 'trained_models/trt_ts/quant_model.pth'
# quantize_to_int8(pth_model_path, int8_model_path)

test_calibrated_inference(int8_model_path)

SM1991CODES avatar Sep 12 '22 16:09 SM1991CODES

As I understand, step 1 should result in a quantized INT8 model. So I should expect a model which is at least 2x smaller in size and 2x faster for inference. However, my model size and inference speed are both same as FP32.

the QAT model is almost the same as the original FP32 model because it still contains the FP32 weights. about the inference time, can you share your QAT ONNX here?

@ttyio for viz

zerollzeng avatar Sep 13 '22 10:09 zerollzeng

Hi,

Thanks for your quick response. Please find attached the file after step 1.

Best Regards Sambit argodet_qat.jit.pt https://drive.google.com/file/d/1lt1UGPY_hS3R23MV2ziq-dbrgb2Ic-Gj/view?usp=drive_web

On Tue, 13 Sept 2022 at 12:42, Zero Zeng @.***> wrote:

As I understand, step 1 should result in a quantized INT8 model. So I should expect a model which is at least 2x smaller in size and 2x faster for inference. However, my model size and inference speed are both same as FP32.

the QAT model is almost the same as the original FP32 model because it still contains the FP32 weights. about the inference time, can you share your QAT ONNX here?

@ttyio https://github.com/ttyio for viz

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/TensorRT/issues/2321#issuecomment-1245228122, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQRX7IA4ADPHCVQQZMNNNBDV6BLBLANCNFSM6AAAAAAQKTXEKQ . You are receiving this because you authored the thread.Message ID: @.***>

SM1991CODES avatar Sep 13 '22 11:09 SM1991CODES

Here is the original pth file. argo_det_020922_zdirb_416x416_50mx2_e_80.pth https://drive.google.com/file/d/1mgdngVrs_BC50N-IFF9Tp32mv5adDaRU/view?usp=drive_web

On Tue, 13 Sept 2022 at 13:28, Sambit Mohapatra < @.***> wrote:

Hi,

Thanks for your quick response. Please find attached the file after step 1.

Best Regards Sambit argodet_qat.jit.pt https://drive.google.com/file/d/1lt1UGPY_hS3R23MV2ziq-dbrgb2Ic-Gj/view?usp=drive_web

On Tue, 13 Sept 2022 at 12:42, Zero Zeng @.***> wrote:

As I understand, step 1 should result in a quantized INT8 model. So I should expect a model which is at least 2x smaller in size and 2x faster for inference. However, my model size and inference speed are both same as FP32.

the QAT model is almost the same as the original FP32 model because it still contains the FP32 weights. about the inference time, can you share your QAT ONNX here?

@ttyio https://github.com/ttyio for viz

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/TensorRT/issues/2321#issuecomment-1245228122, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQRX7IA4ADPHCVQQZMNNNBDV6BLBLANCNFSM6AAAAAAQKTXEKQ . You are receiving this because you authored the thread.Message ID: @.***>

SM1991CODES avatar Sep 13 '22 11:09 SM1991CODES

No access. Have you exported the quantized model to ONNX and inference using TensorRT?

zerollzeng avatar Sep 14 '22 05:09 zerollzeng

No, I use torch-tensorrt and torchscript. Onnx export is not needed in this case, isn't it?

I shall check the permission issue.

On Wed, 14 Sep, 2022, 07:58 Zero Zeng, @.***> wrote:

No access. Have you exported the quantized model to ONNX and inference using TensorRT?

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/TensorRT/issues/2321#issuecomment-1246279617, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQRX7IHXXSIYIBERS33ZH73V6FSPLANCNFSM6AAAAAAQKTXEKQ . You are receiving this because you authored the thread.Message ID: @.***>

SM1991CODES avatar Sep 14 '22 06:09 SM1991CODES

No, I use torch-tensorrt and torchscript. Onnx export is not needed in this case, isn't it?

I believe you need to export to ONNX and use TRT's ONNX parser to get the best performance. @ttyio Correct me if I'm wrong

zerollzeng avatar Sep 14 '22 08:09 zerollzeng

Okay, well, this is new info. Though I tend to agree with you - regarding performance, I see Torch-Tensorrt C++ performance seems to be not as good as other libraries like torch2trt. I haven't tried Torch-Tensorrt Python since my final deployment is C++. In any case, could you please share some mid-complexity example for this flow: pytorch -> onnx -> tensorrt C++? Something like U-Net for semantic segmentation.

I know there is a MNIST example but I think this is too trivial to truly understand the details of working with Tensorrt C++

Please let me know.

SM1991CODES avatar Sep 14 '22 08:09 SM1991CODES

https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/userguide.html#export-to-onnx

zerollzeng avatar Sep 14 '22 16:09 zerollzeng

closing since no activity for more than 3 weeks, please reopen if you still have question, thanks!

ttyio avatar Dec 06 '22 02:12 ttyio