mmdeploy [BUG] TensorRT optimised model is detecting less objects compared to pytorch model, most likely some difference in post processing.

Checklist

[X] I have searched related issues but cannot get the expected help.
[X] 2. I have read the FAQ documentation but cannot get the expected help.
[X] 3. The bug has not been fixed in the latest version.

Describe the bug

So I have managed to train one model on non square input sizes - height-1216, width - 1920. I optimised this model using mmdeploy and converted the model to tensorrt with FP16 precision using the tools/deploy.py script. However, when visualising the sample result, there are less number of objects detected by the TensorRT model as compared to PyTorch model. I believe this is not a problem with optimisation or quantisation, as the objects that have been correctly detected by TensorRT model have the exact same location and confidence as PyTorch model. Moreover, the TensorRT model is only missing objects in places where the objects are closely and densely located, which leads me to believe that there is discrepancy with the post processing pipeline. Please help me in identifying the problem and fixing this. I'm attaching all the config files below for your reference.

Model config file

default_scope = 'mmdet'
default_hooks = dict(
    timer=dict(type='IterTimerHook'),
    logger=dict(type='LoggerHook', interval=50),
    param_scheduler=dict(type='ParamSchedulerHook'),
    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=10),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    visualization=dict(type='DetVisualizationHook'))
env_cfg = dict(
    cudnn_benchmark=False,
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
    dist_cfg=dict(backend='nccl'))
vis_backends = [dict(type='LocalVisBackend')]
visualizer = dict(
    type='DetLocalVisualizer',
    vis_backends=[dict(type='LocalVisBackend')],
    name='visualizer')
log_processor = dict(type='LogProcessor', window_size=50, by_epoch=True)
log_level = 'INFO'
load_from = '/media/chetan/Project/Projects/rtmdet_train/mmdetection/work_dirs/config_corrected_det/epoch_160.pth'
resume = True
train_cfg = dict(
    type='EpochBasedTrainLoop',
    max_epochs=300,
    val_interval=1,
    dynamic_intervals=[(80, 1)])
val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')
param_scheduler = [
    dict(
        type='LinearLR', start_factor=1e-05, by_epoch=False, begin=0,
        end=1000),
    dict(
        type='CosineAnnealingLR',
        eta_min=0.0002,
        begin=150,
        end=300,
        T_max=100,
        by_epoch=True,
        convert_to_iter_based=True)
]
optim_wrapper = dict(
    type='OptimWrapper',
    optimizer=dict(type='AdamW', lr=0.001, weight_decay=0.05),
    paramwise_cfg=dict(
        norm_decay_mult=0, bias_decay_mult=0, bypass_duplicate=True))
auto_scale_lr = dict(enable=False, base_batch_size=96)
dataset_type = 'CocoDataset'
data_root = '/home/chetan/Desktop/rtmdet_training/coco_finetuning_data'
backend_args = None
train_pipeline = [
    dict(type='LoadImageFromFile', backend_args=None),
    dict(
        type='LoadAnnotations',
        with_bbox=True,
        with_mask=False,
        poly2mask=False),
    dict(type='CachedMosaic', img_scale=(1920, 1216), pad_val=114.0, prob=0.2),
    dict(
        type='RandomResize',
        scale=(1920, 1216),
        ratio_range=(0.8, 1.2),
        keep_ratio=True,
        prob=0.1),
    dict(
        type='RandomCrop',
        crop_size=(1920, 1216),
        recompute_bbox=True,
        allow_negative_crop=True,
        prob=0.1),
    dict(type='YOLOXHSVRandomAug', prob=0.1),
    dict(type='RandomFlip', prob=0.5),
    dict(type='Pad', size=(1920, 1216), pad_val=dict(img=(114, 114, 114))),
    dict(
        type='CachedMixUp',
        img_scale=(1920, 1216),
        ratio_range=(1.0, 1.0),
        max_cached_images=20,
        prob=0.1,
        pad_val=(114, 114, 114)),
    dict(type='FilterAnnotations', min_gt_bbox_wh=(1, 1)),
    dict(type='PackDetInputs')
]
test_pipeline = [
    dict(type='LoadImageFromFile', backend_args=None),
    dict(type='Resize', scale=(1920, 1216), keep_ratio=True),
    dict(
        type='PackDetInputs',
        meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
                   'scale_factor'))
]
train_dataloader = dict(
    batch_size=96,
    num_workers=8,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=True),
    batch_sampler=None,
    dataset=dict(
        type='CocoDataset',
        data_root='/home/chetan/Desktop/rtmdet_training/coco_finetuning_data',
        ann_file='train/coco_annotations.json',
        data_prefix=dict(img='train/'),
        filter_cfg=dict(filter_empty_gt=True, min_size=32),
        pipeline=[
            dict(type='LoadImageFromFile', backend_args=None),
            dict(
                type='LoadAnnotations',
                with_bbox=True,
                with_mask=False,
                poly2mask=False),
            dict(
                type='CachedMosaic',
                img_scale=(1920, 1216),
                pad_val=114.0,
                prob=0.2),
            dict(
                type='RandomResize',
                scale=(1920, 1216),
                ratio_range=(0.8, 1.2),
                keep_ratio=True),
            dict(
                type='RandomCrop',
                crop_size=(1920, 1216),
                recompute_bbox=True,
                allow_negative_crop=True),
            dict(type='YOLOXHSVRandomAug'),
            dict(type='RandomFlip', prob=0.5),
            dict(
                type='Pad', size=(1920, 1216),
                pad_val=dict(img=(114, 114, 114))),
            dict(
                type='CachedMixUp',
                img_scale=(1920, 1216),
                ratio_range=(1.0, 1.0),
                max_cached_images=20,
                pad_val=(114, 114, 114),
                prob=0.2),
            dict(type='FilterAnnotations', min_gt_bbox_wh=(1, 1)),
            dict(type='PackDetInputs')
        ],
        backend_args=None,
        metainfo=dict(
            classes=('Neoplastic', 'Inflammatory', 'Stroma',
                     'Necrosis/Dead Cells', 'Normal Epithelial'),
            palette=[(255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 0),
                     (0, 255, 255)])),
    pin_memory=True)
val_dataloader = dict(
    batch_size=32,
    num_workers=8,
    persistent_workers=True,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=dict(
        type='CocoDataset',
        data_root='/home/chetan/Desktop/rtmdet_training/coco_finetuning_data',
        ann_file='val/coco_annotations.json',
        data_prefix=dict(img='val/'),
        test_mode=True,
        pipeline=[
            dict(type='LoadImageFromFile', backend_args=None),
            dict(type='Resize', scale=(1920, 1216), keep_ratio=True),
            dict(
                type='Pad', size=(1920, 1216),
                pad_val=dict(img=(114, 114, 114))),
            dict(
                type='PackDetInputs',
                meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
                           'scale_factor'))
        ],
        backend_args=None,
        metainfo=dict(
            classes=('Neoplastic', 'Inflammatory', 'Stroma',
                     'Necrosis/Dead Cells', 'Normal Epithelial'),
            palette=[(255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 0),
                     (0, 255, 255)])))
test_dataloader = dict(
    batch_size=64,
    num_workers=8,
    persistent_workers=True,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=dict(
        type='CocoDataset',
        data_root='/home/chetan/Desktop/rtmdet_training/coco_finetuning_data',
        ann_file='val/coco_annotations.json',
        data_prefix=dict(img='val/'),
        test_mode=True,
        pipeline=[
            dict(type='LoadImageFromFile', backend_args=None),
            dict(type='Resize', scale=(1920, 1216), keep_ratio=True),
            dict(
                type='Pad', size=(1920, 1216),
                pad_val=dict(img=(114, 114, 114))),
            dict(
                type='PackDetInputs',
                meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
                           'scale_factor'))
        ],
        backend_args=None,
        metainfo=dict(
            classes=('Neoplastic', 'Inflammatory', 'Stroma',
                     'Necrosis/Dead Cells', 'Normal Epithelial'),
            palette=[(255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 0),
                     (0, 255, 255)])))
val_evaluator = dict(
    type='CocoMetric',
    ann_file=
    '/home/chetan/Desktop/rtmdet_training/coco_finetuning_data/val/coco_annotations.json',
    metric='bbox',
    format_only=False,
    backend_args=None,
    proposal_nums=(3000, 1, 10))
test_evaluator = dict(
    type='CocoMetric',
    ann_file=
    '/home/chetan/Desktop/rtmdet_training/coco_finetuning_data/val/coco_annotations.json',
    metric='bbox',
    format_only=False,
    backend_args=None,
    proposal_nums=(3000, 1, 10))
tta_model = dict(
    type='DetTTAModel',
    tta_cfg=dict(nms=dict(type='nms', iou_threshold=0.6), max_per_img=3000))
img_scales = [(1920, 1216), (256, 256)]
tta_pipeline = [
    dict(type='LoadImageFromFile', backend_args=None),
    dict(
        type='TestTimeAug',
        transforms=[[{
            'type': 'Resize',
            'scale': (1920, 1216),
            'keep_ratio': True,
            'prob': 0.0
        }, {
            'type': 'Resize',
            'scale': (144, 128),
            'keep_ratio': True,
            'prob': 0.0
        }, {
            'type': 'Resize',
            'scale': (576, 512),
            'keep_ratio': True,
            'prob': 0.0
        }],
                    [{
                        'type': 'RandomFlip',
                        'prob': 0.0
                    }, {
                        'type': 'RandomFlip',
                        'prob': 0.0
                    }],
                    [{
                        'type': 'Pad',
                        'size': (1920, 1216),
                        'pad_val': {
                            'img': (114, 114, 114)
                        },
                        'prob': 0.0
                    }],
                    [{
                        'type':
                        'PackDetInputs',
                        'meta_keys':
                        ('img_id', 'img_path', 'ori_shape', 'img_shape',
                         'scale_factor', 'flip', 'flip_direction')
                    }]])
]
model = dict(
    type='RTMDet',
    data_preprocessor=dict(
        type='DetDataPreprocessor',
        mean=[179.92, 149.48, 198.26],
        std=[14.06, 11.88, 11.06],
        bgr_to_rgb=True,
        batch_augments=None),
    backbone=dict(
        type='CSPNeXt',
        arch='P5',
        expand_ratio=0.5,
        deepen_factor=0.67,
        widen_factor=0.75,
        channel_attention=True,
        norm_cfg=dict(type='SyncBN'),
        act_cfg=dict(type='SiLU', inplace=True)),
    neck=dict(
        type='CSPNeXtPAFPN',
        in_channels=[192, 384, 768],
        out_channels=192,
        num_csp_blocks=2,
        expand_ratio=0.5,
        norm_cfg=dict(type='SyncBN'),
        act_cfg=dict(type='SiLU', inplace=True)),
    bbox_head=dict(
        type='RTMDetSepBNHead',
        num_classes=5,
        in_channels=192,
        stacked_convs=2,
        feat_channels=192,
        anchor_generator=dict(
            type='MlvlPointGenerator', offset=0, strides=[8, 16, 32]),
        bbox_coder=dict(type='DistancePointBBoxCoder'),
        loss_cls=dict(
            type='QualityFocalLoss',
            use_sigmoid=True,
            beta=2.0,
            loss_weight=1.0),
        loss_bbox=dict(type='GIoULoss', loss_weight=2.0),
        with_objectness=False,
        exp_on_reg=True,
        share_conv=True,
        pred_kernel_size=1,
        norm_cfg=dict(type='SyncBN'),
        act_cfg=dict(type='SiLU', inplace=True)),
    train_cfg=dict(
        assigner=dict(type='DynamicSoftLabelAssigner', topk=13),
        allowed_border=-1,
        pos_weight=-1,
        debug=False),
    test_cfg=dict(
        nms_pre=30000,
        min_bbox_size=0,
        score_thr=0.001,
        nms=dict(type='nms', iou_threshold=0.6),
        max_per_img=3000))
train_pipeline_stage2 = [
    dict(type='LoadImageFromFile', backend_args=None),
    dict(
        type='LoadAnnotations',
        with_bbox=True,
        with_mask=False,
        poly2mask=False),
    dict(
        type='RandomResize',
        scale=(1920, 1216),
        ratio_range=(0.8, 1.2),
        keep_ratio=True),
    dict(
        type='RandomCrop',
        crop_size=(1920, 1216),
        recompute_bbox=True,
        allow_negative_crop=True),
    dict(type='FilterAnnotations', min_gt_bbox_wh=(1, 1)),
    dict(type='YOLOXHSVRandomAug'),
    dict(type='RandomFlip', prob=0.5),
    dict(type='Pad', size=(1920, 1216), pad_val=dict(img=(114, 114, 114))),
    dict(type='PackDetInputs')
]
max_epochs = 300
stage2_num_epochs = 20
base_lr = 0.001
interval = 10
custom_hooks = [
    dict(
        type='EMAHook',
        ema_type='ExpMomentumEMA',
        momentum=0.0002,
        update_buffers=True,
        priority=49),
    dict(
        type='PipelineSwitchHook',
        switch_epoch=280,
        switch_pipeline=[
            dict(type='LoadImageFromFile', backend_args=None),
            dict(
                type='LoadAnnotations',
                with_bbox=True,
                with_mask=False,
                poly2mask=False),
            dict(
                type='RandomResize',
                scale=(1920, 1216),
                ratio_range=(0.8, 1.2),
                keep_ratio=True),
            dict(
                type='RandomCrop',
                crop_size=(1920, 1216),
                recompute_bbox=True,
                allow_negative_crop=True),
            dict(type='FilterAnnotations', min_gt_bbox_wh=(1, 1)),
            dict(type='YOLOXHSVRandomAug', prob=0.1),
            dict(type='RandomFlip', prob=0.5),
            dict(
                type='Pad', size=(1920, 1216),
                pad_val=dict(img=(114, 114, 114))),
            dict(type='PackDetInputs')
        ])
]
metainfo = dict(
    classes=('Neoplastic', 'Inflammatory', 'Stroma', 'Necrosis/Dead Cells',
             'Normal Epithelial'),
    palette=[(255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 0),
             (0, 255, 255)])
launcher = 'none'
work_dir = './work_dirs/config_corrected_det_finetune'

The config file base_static.py

_base_ = ['../../_base_/onnx_config.py']

onnx_config = dict(output_names=['dets', 'labels'], input_shape=None)
codebase_config = dict(
    type='mmdet',
    task='ObjectDetection',
    model_type='end2end',
    post_processing=dict(
        score_threshold=0.05,
        confidence_threshold=0.005,  # for YOLOv3
        iou_threshold=0.6,
        max_output_boxes_per_class=3000,
        pre_top_k=5000,
        keep_top_k=3000,
        background_label_id=-1,
    ))

The tensorrt static optimisation

_base_ = ['./base_static.py', '../../_base_/backends/tensorrt.py']

onnx_config = dict(input_shape=(1920, 1216))

backend_config = dict(
    common_config=dict(max_workspace_size=1 << 30),
    model_inputs=[
        dict(
            input_shapes=dict(
                input=dict(
                    min_shape=[1, 3, 1216, 1920],
                    opt_shape=[1, 3, 1216, 1920],
                    max_shape=[1, 3, 1216, 1920])))
    ])

Below are the detail.json, pipeline.json and deploy.json

deploy.json

{
    "version": "1.0.0",
    "task": "Detector",
    "models": [
        {
            "name": "rtmdet",
            "net": "end2end.engine",
            "weights": "",
            "backend": "tensorrt",
            "precision": "FP16",
            "batch_size": 1,
            "dynamic_shape": false
        }
    ],
    "customs": []
}

detail.json

{
    "version": "1.0.0",
    "codebase": {
        "task": "ObjectDetection",
        "codebase": "mmdet",
        "version": "3.0.0",
        "pth": "/root/workspace/data/finetune_checkpoint_static/epoch_240.pth",
        "config": "/root/workspace/data/finetune_checkpoint_static/config_corrected_det_finetune.py"
    },
    "codebase_config": {
        "type": "mmdet",
        "task": "ObjectDetection",
        "model_type": "end2end",
        "post_processing": {
            "score_threshold": 0.05,
            "confidence_threshold": 0.005,
            "iou_threshold": 0.6,
            "max_output_boxes_per_class": 3000,
            "pre_top_k": 5000,
            "keep_top_k": 3000,
            "background_label_id": -1
        }
    },
    "onnx_config": {
        "type": "onnx",
        "export_params": true,
        "keep_initializers_as_inputs": false,
        "opset_version": 11,
        "save_file": "end2end.onnx",
        "input_names": [
            "input"
        ],
        "output_names": [
            "dets",
            "labels"
        ],
        "input_shape": [
            1920,
            1216
        ],
        "optimize": true
    },
    "backend_config": {
        "type": "tensorrt",
        "common_config": {
            "fp16_mode": true,
            "max_workspace_size": 1073741824
        },
        "model_inputs": [
            {
                "input_shapes": {
                    "input": {
                        "min_shape": [
                            1,
                            3,
                            1216,
                            1920
                        ],
                        "opt_shape": [
                            1,
                            3,
                            1216,
                            1920
                        ],
                        "max_shape": [
                            1,
                            3,
                            1216,
                            1920
                        ]
                    }
                }
            }
        ]
    },
    "calib_config": {}
}

And finally the pipeline.json

{
    "pipeline": {
        "input": [
            "img"
        ],
        "output": [
            "post_output"
        ],
        "tasks": [
            {
                "type": "Task",
                "module": "Transform",
                "name": "Preprocess",
                "input": [
                    "img"
                ],
                "output": [
                    "prep_output"
                ],
                "transforms": [
                    {
                        "type": "LoadImageFromFile",
                        "backend_args": null
                    },
                    {
                        "type": "Resize",
                        "keep_ratio": false,
                        "size": [
                            1920,
                            1216
                        ]
                    },
                    {
                        "type": "Normalize",
                        "to_rgb": true,
                        "mean": [
                            179.92,
                            149.48,
                            198.26
                        ],
                        "std": [
                            14.06,
                            11.88,
                            11.06
                        ]
                    },
                    {
                        "type": "Pad",
                        "size_divisor": 1
                    },
                    {
                        "type": "DefaultFormatBundle"
                    },
                    {
                        "type": "Collect",
                        "meta_keys": [
                            "flip",
                            "img_shape",
                            "scale_factor",
                            "flip_direction",
                            "filename",
                            "img_path",
                            "img_id",
                            "img_norm_cfg",
                            "valid_ratio",
                            "pad_param",
                            "pad_shape",
                            "ori_filename",
                            "ori_shape"
                        ],
                        "keys": [
                            "img"
                        ]
                    }
                ]
            },
            {
                "name": "rtmdet",
                "type": "Task",
                "module": "Net",
                "is_batched": false,
                "input": [
                    "prep_output"
                ],
                "output": [
                    "infer_output"
                ],
                "input_map": {
                    "img": "input"
                },
                "output_map": {}
            },
            {
                "type": "Task",
                "module": "mmdet",
                "name": "postprocess",
                "component": "ResizeBBox",
                "params": {
                    "nms_pre": 30000,
                    "min_bbox_size": 0,
                    "score_thr": 0.001,
                    "nms": {
                        "type": "nms",
                        "iou_threshold": 0.6
                    },
                    "max_per_img": 3000
                },
                "output": [
                    "post_output"
                ],
                "input": [
                    "prep_output",
                    "infer_output"
                ]
            }
        ]
    }
}

Reproduction

python /root/workspace/mmdeploy/tools/deploy.py \
    /root/workspace/mmdeploy/configs/mmdet/detection/detection_tensorrt-fp16_static-AOI.py \
    /root/workspace/data/finetune_checkpoint_static/config_corrected_det_finetune.py \
    /root/workspace/data/finetune_checkpoint_static/epoch_240.pth \
    /root/workspace/data/1105.png \
    --test-img /root/workspace/data/1105.png \
    --work-dir /root/workspace/data/finetune_checkpoint_static \
    --device cuda \
    --log-level INFO \
    --show \
    --dump-info

I had modified the config files to accomodate the resolution of 1216 x 1920. I understood all the changes required and the pytorch model works flawlessly. However, the TensorRT optimised model is unable to predict some objects which are densely located.

Environment

05/11 09:00:49 - mmengine - INFO - 

05/11 09:00:49 - mmengine - INFO - **********Environmental information**********
05/11 09:00:50 - mmengine - INFO - sys.platform: linux
05/11 09:00:50 - mmengine - INFO - Python: 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0]
05/11 09:00:50 - mmengine - INFO - CUDA available: True
05/11 09:00:50 - mmengine - INFO - numpy_random_seed: 2147483648
05/11 09:00:50 - mmengine - INFO - GPU 0: NVIDIA GeForce GTX 1650
05/11 09:00:50 - mmengine - INFO - CUDA_HOME: /usr/local/cuda
05/11 09:00:50 - mmengine - INFO - NVCC: Cuda compilation tools, release 11.6, V11.6.124
05/11 09:00:50 - mmengine - INFO - GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
05/11 09:00:50 - mmengine - INFO - PyTorch: 1.11.0+cu113
05/11 09:00:50 - mmengine - INFO - PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

05/11 09:00:50 - mmengine - INFO - TorchVision: 0.12.0+cu113
05/11 09:00:50 - mmengine - INFO - OpenCV: 4.7.0
05/11 09:00:50 - mmengine - INFO - MMEngine: 0.7.3
05/11 09:00:50 - mmengine - INFO - MMCV: 2.0.0
05/11 09:00:50 - mmengine - INFO - MMCV Compiler: GCC 9.3
05/11 09:00:50 - mmengine - INFO - MMCV CUDA Compiler: 11.3
05/11 09:00:50 - mmengine - INFO - MMDeploy: 1.0.0+
05/11 09:00:50 - mmengine - INFO - 

05/11 09:00:50 - mmengine - INFO - **********Backend information**********
05/11 09:00:50 - mmengine - INFO - tensorrt:	8.2.4.2
05/11 09:00:50 - mmengine - INFO - tensorrt custom ops:	Available
05/11 09:00:50 - mmengine - INFO - ONNXRuntime:	None
05/11 09:00:50 - mmengine - INFO - ONNXRuntime-gpu:	1.8.1
05/11 09:00:50 - mmengine - INFO - ONNXRuntime custom ops:	Available
05/11 09:00:50 - mmengine - INFO - pplnn:	None
05/11 09:00:50 - mmengine - INFO - ncnn:	None
05/11 09:00:50 - mmengine - INFO - snpe:	None
05/11 09:00:50 - mmengine - INFO - openvino:	None
05/11 09:00:50 - mmengine - INFO - torchscript:	1.11.0+cu113
05/11 09:00:50 - mmengine - INFO - torchscript custom ops:	NotAvailable
05/11 09:00:50 - mmengine - INFO - rknn-toolkit:	None
05/11 09:00:50 - mmengine - INFO - rknn-toolkit2:	None
05/11 09:00:50 - mmengine - INFO - ascend:	None
05/11 09:00:50 - mmengine - INFO - coreml:	None
05/11 09:00:50 - mmengine - INFO - tvm:	None
05/11 09:00:50 - mmengine - INFO - vacc:	None
05/11 09:00:50 - mmengine - INFO - 

05/11 09:00:50 - mmengine - INFO - **********Codebase information**********
05/11 09:00:50 - mmengine - INFO - mmdet:	3.0.0
05/11 09:00:50 - mmengine - INFO - mmseg:	None
05/11 09:00:50 - mmengine - INFO - mmpretrain:	None
05/11 09:00:50 - mmengine - INFO - mmocr:	None
05/11 09:00:50 - mmengine - INFO - mmedit:	None
05/11 09:00:50 - mmengine - INFO - mmdet3d:	None
05/11 09:00:50 - mmengine - INFO - mmpose:	None
05/11 09:00:50 - mmengine - INFO - mmrotate:	None
05/11 09:00:50 - mmengine - INFO - mmaction:	None
05/11 09:00:50 - mmengine - INFO - mmrazor:	None



### Error traceback

_No response_

May 11 '23 09:05 himansh1314

test_pipeline = [
    dict(type='LoadImageFromFile', backend_args=None),
    dict(type='Resize', scale=(1920, 1216), keep_ratio=True),
    dict(
        type='PackDetInputs',
        meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
                   'scale_factor'))
]

Please try to use detection_tensorrt-xxx_dynamic-xx.py and edit the min/opt/max shape according to your need. If you inference by pytorch, the resize strategy in preprocess is keep ratio resize. But if you use static model config, the resize will be replaced by a fix size resize which is not same compared to pytorch preprocess.

May 11 '23 11:05 irexyc

Thanks for the response, @irexyc , however, in my case, the image I am giving to model is of a static size and will always remain of the same size (1216x1920), so i don't need to use the resize transform at all. Also, I have tried via the dynamic config, and the results were same.

May 11 '23 11:05 himansh1314

@himansh1314

There is a bug of static resize (mmdet treat the scale of resize as (w, h), while mmdeploy sdk treat it as (h, w)), this pr will fix it https://github.com/open-mmlab/mmdeploy/pull/2063.

You can edit the pipeline.json, swap the two parameter of resize to see if it can help.

May 11 '23 11:05 irexyc

@irexyc Yes, I noticed the bug. And I swapped the parameters and it worked fine. Infact, I removed the 'Resize' transform completely and it worked fine as well because by default, my image is of the size (1216x1920), and all the images in my pipeline are of static size. However, still the same issue as I mentioned above.

May 11 '23 11:05 himansh1314

@irexyc I think there is some issue with post processing in TensorRT. I'm saying this as the objects detected by torch model and tensorrt have the exact same confidence score. So, it's pretty clear that the preprocessing, inferencing and model optimisation is same, but something is different in post processing(probably nms), which is discarding some detections.

May 11 '23 11:05 himansh1314

@himansh1314 We made nms as part of model inference instead of postprocess. However the settings for nms are fixed (won't read your model config) when you convert the model. You can edit some parameters in this file configs/mmdet/_base_/base_static.py before convert the model

May 11 '23 11:05 irexyc

@irexyc So I understood what is possibly the issue. In the configs/mmdet/_base_/base_static.py file, the parameter pre_top_k is set to 5000. I'm not sure what this parameter means, but my best guess is that indicates the top 5000 predictions based on confidence score, which are then sent to nms for post processing. However, in my config.py file, I had set this to 30000, and hence, the pytorch model was able to detect more objects. However, when I changed this 30000 in configs/mmdet/_base_/base_static.py file, the model was able to convert successfully, however, when the tools/deloy.py was testing the model for visualisation, it crashed and returned the error

05/11 12:42:35 - mmengine - INFO - Successfully loaded tensorrt plugins from /root/workspace/mmdeploy/mmdeploy/lib/libmmdeploy_tensorrt_ops.so
05/11 12:42:35 - mmengine - INFO - Successfully loaded tensorrt plugins from /root/workspace/mmdeploy/mmdeploy/lib/libmmdeploy_tensorrt_ops.so
#assertion/root/workspace/mmdeploy/csrc/mmdeploy/backend_ops/tensorrt/common_impl/nms/allClassNMS.cu,210
05/11 12:42:52 - mmengine - ERROR - /root/workspace/mmdeploy/tools/deploy.py - create_process - 82 - visualize tensorrt model failed.

Also, when I try to inference using the mmdeploy_runtime sdk, the operation gets aborted during inferencing. I don't understand, how come pytorch nms is able to handle 30000 predictions whereas TensorRT fails? I also tried changing the value to 15000, 10000, and 7500, but nothing worked. I think this is an important issue and would really appreciate if you could help me with this. @irexyc

May 11 '23 13:05 himansh1314

@himansh1314

It seems that you have already modified some content of base_static.py. However, you doesn't modify the score_threshold to 0.001. Not sure if it could help, you can have a try.

The config file base_static.py

_base_ = ['../../_base_/onnx_config.py']
onnx_config = dict(output_names=['dets', 'labels'], input_shape=None)
codebase_config = dict(
    type='mmdet',
    task='ObjectDetection',
    model_type='end2end',
    post_processing=dict(
        score_threshold=0.05,
        confidence_threshold=0.005,  # for YOLOv3
        iou_threshold=0.6,
        max_output_boxes_per_class=3000,
        pre_top_k=5000,
        keep_top_k=3000,
        background_label_id=-1,
    ))

There are some assert in allClassNMS.cu, @grimoire could have a look at this.

May 12 '23 03:05 irexyc

This is the error message I get, when I convert the model using deploy/tools.py after changing a few configurations in base_static.py file.

 build instance. This may cause unexpected failure when running the built modules. Please check whether "mmdet" is a correct scope, or whether the registry is initialized.
05/12 05:05:21 - mmengine - INFO - Successfully loaded tensorrt plugins from /root/workspace/mmdeploy/mmdeploy/lib/libmmdeploy_tensorrt_ops.so
05/12 05:05:21 - mmengine - INFO - Successfully loaded tensorrt plugins from /root/workspace/mmdeploy/mmdeploy/lib/libmmdeploy_tensorrt_ops.so
#assertion/root/workspace/mmdeploy/csrc/mmdeploy/backend_ops/tensorrt/common_impl/nms/allClassNMS.cu,210
05/12 05:05:38 - mmengine - ERROR - /root/workspace/mmdeploy/tools/deploy.py - create_process - 82 - visualize tensorrt model failed.

The code indeed asserts something.

Here's the base_static.py file

_base_ = ['../../_base_/onnx_config.py']

onnx_config = dict(output_names=['dets', 'labels'], input_shape=None)
codebase_config = dict(
    type='mmdet',
    task='ObjectDetection',
    model_type='end2end',
    post_processing=dict(
        score_threshold=0.001,
        confidence_threshold=0.005,  # for YOLOv3
        iou_threshold=0.6,
        max_output_boxes_per_class=3000,
        pre_top_k=10000,
        keep_top_k=3000,
        background_label_id=-1,
    ))

Note that, in the pytorch code, the predictions before nms is set to 30000, and when in base_static.py, the pre_top_k, is set to 5000. My model is supposed to predict over 1000 objects, which can be densely populated, and hence I tried changing it to larger values like 30000, 10000 etc. I just want my tensorrt model to give correct predictions like pytorch model and not miss any predictions. Please look into this. @irexyc @grimoire

May 12 '23 05:05 himansh1314

Please keep pre_top_k = 5000, there are asserts in allClassNMS.cu

  const static int BS = 512;
   ...
  const int t_size = (top_k + BS - 1) / BS;

  ASSERT(t_size <= 10);

May 12 '23 05:05 irexyc

@irexyc @grimoire I understand that there are some asserts. Is there any other way around this? I think this cap on pre_top_k is causing the huge difference between the performance on tensorrt and pytorch model? Is there any other way where I can do inferencing using TensorRT and the NMS and post process can be done on PyTorch? I don't mind if the latency shoots up a little bit.

May 12 '23 05:05 himansh1314

@irexyc if I comment out the ASSERT part in the repository and build the container again entirely, would it work? or there are some other checks and dependencies as well in some other part of the code?

May 12 '23 05:05 himansh1314

I think it will work for nms after comment the assert.

I'm not quite sure why there has assert compared to https://github.com/NVIDIA/TensorRT/blob/master/plugin/common/kernels/allClassNMS.cu @grimoire may explain to you

With score_threshold=0.001, the results are still very different compared to pytorch right?

May 12 '23 05:05 irexyc

t_size is the cache size of each cuda thread in NMS kernel. https://github.com/NVIDIA/TensorRT/blob/96e23978cd6e4a8fe869696d3d8ec2b47120629b/plugin/common/kernels/allClassNMS.cu#L196

Large cache size will lead to low occupancy(large amount of registers are required). https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm

If you insist ... Add p(X) in https://github.com/open-mmlab/mmdeploy/blob/26b66ef5112ce47b8f4562eef49aae9614b8c633/csrc/mmdeploy/backend_ops/tensorrt/common_impl/nms/allClassNMS.cu#L202 and comment the assert.

May 12 '23 06:05 grimoire

@irexyc, Yes even with score threshold of 0.001, the results didn't change. @grimoire , Hi, I appreciate you helping me, thanks. Can you please tell me what is X in P(X) that you mentioned, and is possible can you please mention the exact change? Sorry, I'm not much familiar with CUDA programming. I would appreciate if you could help me with exactly the code changes, specifically, if I want to set the top_pre_k parameter to, let's say, 30000.

My guess is that X in P(X) is the number of registers, or probably threads you mean? In that case, should I modify it like

#define P(tsize) allClassNMS_kernel<T_SCORE, T_BBOX, (tsize)>

  void (*kernel[30])(const int, const int, const int, const int, const float, const bool,
                     const bool, float *, T_SCORE *, int *, T_SCORE *, int *, bool) = {
      P(1), P(2), P(3), P(4), P(5), P(6), P(7), P(8), P(9), P(10),
      P(11), P(12), P(13), P(14), P(15), P(16), P(17), P(18), P(19), P(20),
      P(21), P(22), P(23), P(24), P(25), P(26), P(27), P(28), P(29), P(30)
  };
//ASSERT(t_size <= 10);

Also, just to confirm, if I change the code, I'll have to build the entire thing again right? And from what part am I supposed to build specifically? I'm optimising and running the SDK inside the docker container that you provided, so I guess I have to make changes once the repo is cloned and so on.

May 12 '23 06:05 himansh1314

X is the t_size you want.

// BS is 512
const int t_size = (top_k + BS - 1) / BS;

So 30000 requires t_size = 60 I guess?

As the dockerfile indicates https://github.com/open-mmlab/mmdeploy/blob/26b66ef5112ce47b8f4562eef49aae9614b8c633/docker/GPU/Dockerfile#L68

MMDeploy should have been placed in the container somewhere. So make again in the build path after update the code should be enough.

May 12 '23 06:05 grimoire

Yes, correct, the t_size should be 60, my bad, I should have written 'so on..' after P(30). Anyways I will make the changes and build again, and see if it works. Thanks for helping. Will update you how it goes.

May 12 '23 06:05 himansh1314

I made the changes and started the build process again, however, it ends up with this error after make -j$(nproc)

[ 80%] Linking CUDA device code CMakeFiles/mmdeploy_tensorrt_ops.dir/cmake_device_link.o
[ 80%] Building CXX object csrc/mmdeploy/net/trt/CMakeFiles/mmdeploy_trt_net.dir/trt_net.cpp.o
[ 80%] Linking CXX shared module ../../../../lib/libmmdeploy_tensorrt_ops.so
/usr/bin/ld: CMakeFiles/mmdeploy_tensorrt_ops_obj.dir/common_impl/nms/allClassNMS.cu.o: in function `allClassNMS(CUstream_st*, int, int, int, int, float, bool, bool, nvinfer1::DataType, nvinfer1::DataType, void*, void*, void*, void*, void*, bool)':
tmpxft_00005952_00000000-6_allClassNMS.compute_87.cudafe1.cpp:(.text+0x10): multiple definition of `allClassNMS(CUstream_st*, int, int, int, int, float, bool, bool, nvinfer1::DataType, nvinfer1::DataType, void*, void*, void*, void*, void*, bool)'; CMakeFiles/mmdeploy_tensorrt_ops_obj.dir/common_impl/nms/.ipynb_checkpoints/allClassNMS-checkpoint.cu.o:tmpxft_00005950_00000000-6_allClassNMS-checkpoint.compute_87.cudafe1.cpp:(.text+0x10): first defined here
/usr/bin/ld: CMakeFiles/mmdeploy_tensorrt_ops_obj.dir/common_impl/nms/allClassNMS.cu.o: in function `nmsInit()':
tmpxft_00005952_00000000-6_allClassNMS.compute_87.cudafe1.cpp:(.text+0x120): multiple definition of `nmsInit()'; CMakeFiles/mmdeploy_tensorrt_ops_obj.dir/common_impl/nms/.ipynb_checkpoints/allClassNMS-checkpoint.cu.o:tmpxft_00005950_00000000-6_allClassNMS-checkpoint.compute_87.cudafe1.cpp:(.text+0x120): first defined here
collect2: error: ld returned 1 exit status
make[2]: *** [csrc/mmdeploy/backend_ops/tensorrt/CMakeFiles/mmdeploy_tensorrt_ops.dir/build.make:240: lib/libmmdeploy_tensorrt_ops.so] Error 1
make[1]: *** [CMakeFiles/Makefile2:262: csrc/mmdeploy/backend_ops/tensorrt/CMakeFiles/mmdeploy_tensorrt_ops.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 80%] Linking CXX static library ../../../../lib/libmmdeploy_trt_net.a
[ 80%] Built target mmdeploy_trt_net
make: *** [Makefile:130: all] Error 2

May 12 '23 07:05 himansh1314

Can you try to clean the old build folder and build again?

May 12 '23 07:05 irexyc

@irexyc @grimoire Didn't help, started the entire docker process again, This time got another error.

/usr/bin/ld: CMakeFiles/mmdeploy_tensorrt_ops_obj.dir/common_impl/nms/allClassNMS.cu.o: in function `allClassNMS(CUstream_st*, int, int, int, int, float, bool, bool, nvinfer1::DataType, nvinfer1::DataType, void*, void*, void*, void*, void*, bool)':
tmpxft_00000337_00000000-6_allClassNMS.compute_87.cudafe1.cpp:(.text+0x10): multiple definition of `allClassNMS(CUstream_st*, int, int, int, int, float, bool, bool, nvinfer1::DataType, nvinfer1::DataType, void*, void*, void*, void*, void*, bool)'; CMakeFiles/mmdeploy_tensorrt_ops_obj.dir/common_impl/nms/.ipynb_checkpoints/allClassNMS-checkpoint.cu.o:tmpxft_00000338_00000000-6_allClassNMS-checkpoint.compute_87.cudafe1.cpp:(.text+0x10): first defined here
/usr/bin/ld: CMakeFiles/mmdeploy_tensorrt_ops_obj.dir/common_impl/nms/allClassNMS.cu.o: in function `nmsInit()':
tmpxft_00000337_00000000-6_allClassNMS.compute_87.cudafe1.cpp:(.text+0x120): multiple definition of `nmsInit()'; CMakeFiles/mmdeploy_tensorrt_ops_obj.dir/common_impl/nms/.ipynb_checkpoints/allClassNMS-checkpoint.cu.o:tmpxft_00000338_00000000-6_allClassNMS-checkpoint.compute_87.cudafe1.cpp:(.text+0x120): first defined here
collect2: error: ld returned 1 exit status
make[2]: *** [csrc/mmdeploy/backend_ops/tensorrt/CMakeFiles/mmdeploy_tensorrt_ops.dir/build.make:240: lib/libmmdeploy_tensorrt_ops.so] Error 1
make[1]: *** [CMakeFiles/Makefile2:226: csrc/mmdeploy/backend_ops/tensorrt/CMakeFiles/mmdeploy_tensorrt_ops.dir/all] Error 2
make: *** [Makefile:130: all] Error 2

May 12 '23 07:05 himansh1314

@irexyc @grimoire Can you please create another branch temporarily where you fix this issue? It would be very helpful for not only me but for all the developers that are building their own detectors on custom datasets with different requirements?

May 12 '23 09:05 himansh1314

Can you print the output of git diff under mmdeploy root folder. I want to know the modification you did

May 12 '23 09:05 irexyc

git_diff

There you go. Please have a look at it, and let me know of any changes.

May 12 '23 09:05 himansh1314

The changes are same with mine.

I made the following steps and didn't meet any error.

docker run -it --rm --gpus all ubuntu20.04-cuda11.3-mmdeploy1.0.0 
cd /root/workspace/mmdeploy/build
vim ../csrc/mmdeploy/backend_ops/tensorrt/common_impl/nms/allClassNMS.cu # edit the code
make -j8 && make install

May 12 '23 09:05 irexyc

@irexyc So you didn't go through the cmake process again like mentioned in the dockerfile?

RUN git clone -b main https://github.com/open-mmlab/mmdeploy &&\
    cd mmdeploy &&\
    if [ -z ${VERSION} ] ; then echo "No MMDeploy version passed in, building on main" ; else git checkout tags/v${VERSION} -b tag_v${VERSION} ; fi &&\
    git submodule update --init --recursive &&\
    mkdir -p build &&\
    cd build &&\
    cmake -DMMDEPLOY_TARGET_BACKENDS="ort;trt" .. &&\
    make -j$(nproc) &&\
    cd .. &&\
    /opt/conda/bin/mim install -e .

I can't see the cmake instruction in your code that you step that you shared just now.

May 12 '23 09:05 himansh1314

@himansh1314 No, I didn't go through the cmake process because I didn't meet compilier error.

You met the compilier error, so I suggest you to delete the build folder and re-configure the project.

cmake -DMMDEPLOY_TARGET_BACKENDS="ort;trt" .. only build custom ops. Since you use sdk, you could refer these lines to configure and build mmdeploy https://github.com/open-mmlab/mmdeploy/blob/main/docker/GPU/Dockerfile#L89C1-L102

May 12 '23 10:05 irexyc

@irexyc @grimoire I was able to make changes to the allClassNMS.cu file and compile it successfully. However, this time after converting, I got another assertion error from file csrc/mmdeploy/backend_ops/tensorrt/batched_nms/trt_batched_nms.cpp at line 103.

Also, I got the error too many resources requested from allClassNMS.cu file at line 703

Is there a way I can run NMS separately in pytorch? NMS with top_pred_k of 30000 seems to be working fine on pytorch. I don't mind if inferencing time increases a little bit.

May 12 '23 11:05 himansh1314

@himansh1314 @irexyc @grimoire I had the same issue in mmdeploy 0.13. You are right this is wrong preprocessing conversion. This issue is with keep_ratio not converted correctly.

This is a copy from your config:

config.py pipeline=[ dict(type='LoadImageFromFile', backend_args=None), dict(type='Resize', scale=(1920, 1216), keep_ratio=True), dict( type='Pad', size=(1920, 1216), pad_val=dict(img=(114, 114, 114))), dict( type='PackDetInputs', meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'scale_factor')) ],

pipeline.json "transforms": [ { "type": "LoadImageFromFile", "backend_args": null }, { "type": "Resize", "keep_ratio": false, "size": [ 1920, 1216 ] }, { "type": "Normalize", "to_rgb": true, "mean": [ 179.92, 149.48, 198.26 ], "std": [ 14.06, 11.88, 11.06 ] }, { "type": "Pad", "size_divisor": 1 }, .....

Jun 06 '23 15:06 shimen

@RunningLeon @irexyc @grimoire Can you confirm there is a bug in 0.13. Please see my previous comment

Jun 18 '23 11:06 shimen

@himansh1314 @irexyc @grimoire I had the same issue in mmdeploy 0.13. You are right this is wrong preprocessing conversion. This issue is with keep_ratio not converted correctly.

This is a copy from your config:

config.py pipeline=[ dict(type='LoadImageFromFile', backend_args=None), dict(type='Resize', scale=(1920, 1216), keep_ratio=True), dict( type='Pad', size=(1920, 1216), pad_val=dict(img=(114, 114, 114))), dict( type='PackDetInputs', meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape', 'scale_factor')) ],

pipeline.json "transforms": [ { "type": "LoadImageFromFile", "backend_args": null }, { "type": "Resize", "keep_ratio": false, "size": [ 1920, 1216 ] }, { "type": "Normalize", "to_rgb": true, "mean": [ 179.92, 149.48, 198.26 ], "std": [ 14.06, 11.88, 11.06 ] }, { "type": "Pad", "size_divisor": 1 }, .....

@shimen hi, This model config(has PackDetInputs) is from mmdet3.0 which should use mmdeploy>=1.0.0

Jun 26 '23 03:06 RunningLeon

mmdeploy mmdeploy copied to clipboard

[BUG] TensorRT optimised model is detecting less objects compared to pytorch model, most likely some difference in post processing.

Checklist

Describe the bug

Reproduction

Environment

mmdeploy
mmdeploy copied to clipboard