mmdeploy Score/confidence of prediction drop a lot after convert to trt engine

I am using Dyhead to train an image detection model: https://github.com/open-mmlab/mmdetection/blob/master/configs/dyhead/atss_swin-l-p4-w12_fpn_dyhead_mstrain_2x_coco.py

Using GPU docker, convert to tensorrt with tools/deploy.py success: python3 tools/deploy.py /workdir/detection_onnx_static_1024x1024.py /workdir/atss_swin-l-p4-w12_fpn_dyhead_mstrain_2x.py /workdir/latest.pth /workdir/test-deploy-img-1024.jpg --device cuda --dump-info

Although the conversion have a lot warning like below:

/root/workspace/mmdeploy/mmdeploy/core/optimizers/function_marker.py:158: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  ys_shape = tuple(int(s) for s in ys.shape)
/root/workspace/mmdeploy/mmdeploy/codebase/mmdet/models/detectors/base.py:24: TracerWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results).
  img_shape = [int(val) for val in img_shape]
/root/workspace/mmdeploy/mmdeploy/codebase/mmdet/models/backbones.py:202: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  slice_w = (W + self.window_size - 1) // self.window_size * self.window_size
WARNING: The shape inference of mmdeploy::MMCVModulatedDeformConv2d type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of mmdeploy::TRTInstanceNormalization type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
[09/01/2022-01:33:56] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +809, GPU +348, now: CPU 3460, GPU 785 (MiB)
[09/01/2022-01:33:56] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +126, GPU +60, now: CPU 3586, GPU 845 (MiB)
[09/01/2022-01:33:56] [TRT] [W] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.3.2
[09/01/2022-01:33:56] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[09/01/2022-01:35:06] [TRT] [I] Some tactics do not have sufficient workspace memory to run. Increasing workspace size will enable more tactics, please check verbose output for requested sizes.
[09/01/2022-01:36:46] [TRT] [I] Detected 1 inputs and 2 output network tensors.
[09/01/2022-01:36:48] [TRT] [I] Total Host Persistent Memory: 500480
[09/01/2022-01:36:48] [TRT] [I] Total Device Persistent Memory: 393728
[09/01/2022-01:36:48] [TRT] [I] Total Scratch Memory: 1918222336
[09/01/2022-01:36:48] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 0 MiB
[09/01/2022-01:36:53] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 5282.29ms to assign 35 blocks to 1075 nodes requiring 2039239680 bytes.
[09/01/2022-01:36:53] [TRT] [I] Total Activation Memory: 2039239680
[09/01/2022-01:36:53] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5632, GPU 2311 (MiB)
[09/01/2022-01:36:53] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 5632, GPU 2319 (MiB)
[09/01/2022-01:36:53] [TRT] [W] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.3.2
[09/01/2022-01:36:53] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[09/01/2022-01:36:53] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
[09/01/2022-01:36:53] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
2022-09-01 01:36:54,590 - mmdeploy - INFO - Finish pipeline mmdeploy.backend.tensorrt.onnx2tensorrt.onnx2tensorrt
2022-09-01 01:36:55,388 - mmdeploy - WARNING - "visualize_model" has been skipped may be because it's             running on a headless device.
2022-09-01 01:36:55,388 - mmdeploy - INFO - All process success.

I know after conversion the result is not exactly the same so I am okey with bounding box value difference(although its off quite a bit too), but the score is kind of drop too much! Below is the trt engine result (x1,y1,x2,y2,score) [8.8506012, 358.41714, 149.80162, 495.56137, 0.081301391] And below is the original predict with mmdet, the bbox have been round down [0, 328, 165, 526, 0.53286]

Here is my env with python3 tools/check_env.py

2022-09-01 02:08:19,488 - mmdeploy - INFO - 

2022-09-01 02:08:19,489 - mmdeploy - INFO - **********Environmental information**********
2022-09-01 02:08:19,681 - mmdeploy - INFO - sys.platform: linux
2022-09-01 02:08:19,681 - mmdeploy - INFO - Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
2022-09-01 02:08:19,681 - mmdeploy - INFO - CUDA available: True
2022-09-01 02:08:19,681 - mmdeploy - INFO - GPU 0: NVIDIA RTX A4000
2022-09-01 02:08:19,681 - mmdeploy - INFO - CUDA_HOME: /usr/local/cuda
2022-09-01 02:08:19,681 - mmdeploy - INFO - NVCC: Cuda compilation tools, release 11.6, V11.6.124
2022-09-01 02:08:19,681 - mmdeploy - INFO - GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
2022-09-01 02:08:19,681 - mmdeploy - INFO - PyTorch: 1.12.0
2022-09-01 02:08:19,681 - mmdeploy - INFO - PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2022.0-Product Build 20211112 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.6
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.3.2  (built against CUDA 11.5)
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

2022-09-01 02:08:19,681 - mmdeploy - INFO - TorchVision: 0.13.0
2022-09-01 02:08:19,681 - mmdeploy - INFO - OpenCV: 4.6.0
2022-09-01 02:08:19,681 - mmdeploy - INFO - MMCV: 1.6.1
2022-09-01 02:08:19,681 - mmdeploy - INFO - MMCV Compiler: GCC 9.3
2022-09-01 02:08:19,681 - mmdeploy - INFO - MMCV CUDA Compiler: 11.6
2022-09-01 02:08:19,681 - mmdeploy - INFO - MMDeploy: 0.7.0+21775ce
2022-09-01 02:08:19,681 - mmdeploy - INFO - 

2022-09-01 02:08:19,681 - mmdeploy - INFO - **********Backend information**********
2022-09-01 02:08:20,066 - mmdeploy - INFO - onnxruntime: 1.8.1	ops_is_avaliable : True
2022-09-01 02:08:20,091 - mmdeploy - INFO - tensorrt: 8.4.3.1	ops_is_avaliable : True
2022-09-01 02:08:20,105 - mmdeploy - INFO - ncnn: None	ops_is_avaliable : False
2022-09-01 02:08:20,106 - mmdeploy - INFO - pplnn_is_avaliable: False
2022-09-01 02:08:20,107 - mmdeploy - INFO - openvino_is_avaliable: False
2022-09-01 02:08:20,121 - mmdeploy - INFO - snpe_is_available: False
2022-09-01 02:08:20,121 - mmdeploy - INFO - 

2022-09-01 02:08:20,121 - mmdeploy - INFO - **********Codebase information**********
2022-09-01 02:08:20,122 - mmdeploy - INFO - mmdet:	2.25.1
2022-09-01 02:08:20,122 - mmdeploy - INFO - mmseg:	None
2022-09-01 02:08:20,122 - mmdeploy - INFO - mmcls:	None
2022-09-01 02:08:20,122 - mmdeploy - INFO - mmocr:	None
2022-09-01 02:08:20,122 - mmdeploy - INFO - mmedit:	None
2022-09-01 02:08:20,122 - mmdeploy - INFO - mmdet3d:	None
2022-09-01 02:08:20,122 - mmdeploy - INFO - mmpose:	None
2022-09-01 02:08:20,122 - mmdeploy - INFO - mmrotate:	None

And here is my deploy config (/workdir/detection_onnx_static_1024x1024.py):

codebase_config = dict(
    type='mmdet',
    task='ObjectDetection',
    model_type='end2end',
    post_processing=dict(
        score_threshold=0.05,
        confidence_threshold=0.005,  # for YOLOv3
        iou_threshold=0.5,
        max_output_boxes_per_class=200,
        pre_top_k=5000,
        keep_top_k=100,
        background_label_id=-1,
    )
)

onnx_config = dict(
    type='onnx',
    export_params=True,
    keep_initializers_as_inputs=False,
    opset_version=11,
    save_file='dyhead_swin_1024.onnx',
    input_names=['input'],
    output_names=['dets', 'labels'],
    input_shape=[1024, 1024],
    optimize=True
)

backend_config = dict(
        type='tensorrt',
        common_config=dict(fp16_mode=False, max_workspace_size=8 << 30),
        model_inputs=[
            dict(
                input_shapes=dict(
                    input=dict(min_shape=[1, 3, 1024, 1024],
                        opt_shape=[1, 3, 1024, 1024],
                        max_shape=[1, 3, 1024, 1024]
                    )
                )
            )
        ]
)

I also try with the default tensorrt version 8.2.x but no success

Could someone please help? Thanks a lot!!

Sep 01 '22 02:09 tak-ho

Hi, what about the result under dynamic config?

Sep 01 '22 02:09 hanrui1sensetime

Hi, what about the result under dynamic config?

[3.2238083, 357.37311, 151.36804, 497.31665, 0.078171365]

And the dynamic config is something like this:

codebase_config = dict(
    type='mmdet', 
    task='ObjectDetection', 
    model_type='end2end',
    post_processing=dict(
        score_threshold=0.05,
        confidence_threshold=0.005,  # for YOLOv3
        iou_threshold=0.5,
        max_output_boxes_per_class=200,
        pre_top_k=5000,
        keep_top_k=100,
        background_label_id=-1,
    )
)

onnx_config = dict(
    type='onnx',
    export_params=True,
    keep_initializers_as_inputs=False,
    opset_version=11,
    save_file='dyhead_swin_dynamic_1024.onnx',
    input_names=['input'],
    output_names=['dets', 'labels'],
    input_shape=None,
    optimize=True,
    dynamic_axes={
        'input': {
            0: 'batch',
            2: 'height',
            3: 'width'
        },
        'dets': {
            0: 'batch',
            1: 'num_dets',
        },
        'labels': {
            0: 'batch',
            1: 'num_dets',
        },
    }
)

backend_config = dict(
        type='tensorrt', 
        common_config=dict(fp16_mode=False, max_workspace_size=8 << 30),
        model_inputs=[
            dict(
                input_shapes=dict(
                    input=dict(min_shape=[1, 3, 1024, 1024],
                        opt_shape=[1, 3, 1024, 1024],
                        max_shape=[1, 3, 1024, 1024]
                    )
                )
            )
        ]
)

Here is the code that I use to predict using tensorrt:

import pycuda.autoinit
import tensorrt as trt
from mmdeploy.backend.tensorrt.init_plugins import load_tensorrt_plugin
load_tensorrt_plugin()
import cv2
import numpy as np

# All are from the tensorrt examples
from inference_server.engine import build_engine, save_engine, load_engine, get_engine_in_out_shape
from inference_server.common import allocate_buffers, do_inference_v2

engine = load_engine('/workdir/dyhead_swin_dynamic_1024.engine')
ec = engine.create_execution_context()
inputs, outputs, bindings, stream = allocate_buffers(engine)
engine_out_shape = get_engine_in_out_shape(engine)
with open('/workdir/test-960.jpg', 'rb') as f:
    im = f.read()

batch_data = np.zeros((1, 3, 1024, 1024), dtype=np.float32)
from mmcv.image.io import imfrombytes
cv_im = imfrombytes(im)
cv_im = np.array(cv_im, dtype=np.uint8, order='C')

from mmdet.datasets.pipelines.transforms import Pad, Resize, Normalize

# same as the testing pipeline in the model config
im_res = Resize(img_scale=(960,960), keep_ratio=True, backend='pillow')({'img':cv_im})
im_res = Normalize([123.675, 116.28, 103.53], [58.395, 57.12, 57.375], to_rgb=True)(im_res)
im_res = Pad(size_divisor=128)(im_res)

nor_cv_im = im_res['img'].transpose((2, 0, 1))
batch_data[0] = nor_cv_im
inputs[0].host = np.ascontiguousarray(batch_data)
res = do_inference_v2(ec, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)

Sep 01 '22 04:09 tak-ho

Could you please provide me with the files you have used? /workdir/detection_onnx_static_1024x1024.py /workdir/atss_swin-l-p4-w12_fpn_dyhead_mstrain_2x.py /workdir/latest.pth. Are they different from our official config and checkpoints?

Sep 01 '22 06:09 hanrui1sensetime

/workdir/detection_onnx_static_1024x1024.py is the one I posted above, let me post it again

codebase_config = dict(
    type='mmdet',
    task='ObjectDetection',
    model_type='end2end',
    post_processing=dict(
        score_threshold=0.05,
        confidence_threshold=0.005,  # for YOLOv3
        iou_threshold=0.5,
        max_output_boxes_per_class=200,
        pre_top_k=5000,
        keep_top_k=100,
        background_label_id=-1,
    )
)

onnx_config = dict(
    type='onnx',
    export_params=True,
    keep_initializers_as_inputs=False,
    opset_version=11,
    save_file='dyhead_swin_1024.onnx',
    input_names=['input'],
    output_names=['dets', 'labels'],
    input_shape=[1024, 1024],
    optimize=True
)

backend_config = dict(
        type='tensorrt',
        common_config=dict(fp16_mode=False, max_workspace_size=8 << 30),
        model_inputs=[
            dict(
                input_shapes=dict(
                    input=dict(min_shape=[1, 3, 1024, 1024],
                        opt_shape=[1, 3, 1024, 1024],
                        max_shape=[1, 3, 1024, 1024]
                    )
                )
            )
        ]
)

/workdir/atss_swin-l-p4-w12_fpn_dyhead_mstrain_2x.py:

checkpoint_config = dict(interval=1)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
custom_hooks = [dict(type='NumClassCheckHook')]
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
opencv_num_threads = 0
mp_start_method = 'fork'
auto_scale_lr = dict(enable=False, base_batch_size=16)
pretrained = 'https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_large_patch4_window12_384_22k.pth'
model = dict(
    type='ATSS',
    backbone=dict(
        type='SwinTransformer',
        pretrain_img_size=384,
        embed_dims=192,
        depths=[2, 2, 18, 2],
        num_heads=[6, 12, 24, 48],
        window_size=12,
        mlp_ratio=4,
        qkv_bias=True,
        qk_scale=None,
        drop_rate=0.0,
        attn_drop_rate=0.0,
        drop_path_rate=0.2,
        patch_norm=True,
        out_indices=(1, 2, 3),
        with_cp=False,
        convert_weights=True,
        init_cfg=dict(
            type='Pretrained',
            checkpoint=
'https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_large_patch4_window12_384_22k.pth'
        )),
    neck=[
        dict(
            type='FPN',
            in_channels=[384, 768, 1536],
            out_channels=256,
            start_level=0,
            add_extra_convs='on_output',
            num_outs=5),
        dict(
            type='DyHead',
            in_channels=256,
            out_channels=256,
            num_blocks=6,
            zero_init_offset=False)
    ],
    bbox_head=dict(
        type='ATSSHead',
        num_classes=1,
        in_channels=256,
        pred_kernel_size=1,
        stacked_convs=0,
        feat_channels=256,
        anchor_generator=dict(
            type='AnchorGenerator',
            ratios=[1.0],
            octave_base_scale=8,
            scales_per_octave=1,
            strides=[8, 16, 32, 64, 128],
            center_offset=0.5),
        bbox_coder=dict(
            type='DeltaXYWHBBoxCoder',
            target_means=[0.0, 0.0, 0.0, 0.0],
            target_stds=[0.1, 0.1, 0.2, 0.2]),
        loss_cls=dict(
            type='FocalLoss',
            use_sigmoid=True,
            gamma=2.0,
            alpha=0.25,
            loss_weight=1.0),
        loss_bbox=dict(type='GIoULoss', loss_weight=2.0),
        loss_centerness=dict(
            type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0)),
    train_cfg=dict(
        assigner=dict(type='ATSSAssigner', topk=9),
        allowed_border=-1,
        pos_weight=-1,
        debug=False),
    test_cfg=dict(
        nms_pre=1000,
        min_bbox_size=0,
        score_thr=0.05,
        nms=dict(type='nms', iou_threshold=0.6),
        max_per_img=100))

dataset_type = 'CocoDataset'
classes = ('abc', )
data_root = '/mnt/data/dataset/abc-detection/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)

data = dict(
    samples_per_gpu=1,
    workers_per_gpu=1,
    train=dict(
        type='RepeatDataset',
        times=2,
        dataset=dict(
            type='CocoDataset',
            classes=('abc', ),
            filter_empty_gt=False,
            ann_file=
            '/mnt/data/dataset/abc-detection/labels/coco.json',
            img_prefix=
            '/mnt/data/dataset/abc-detection/images/',
            pipeline=[
                dict(type='LoadImageFromFile'),
                dict(type='LoadAnnotations', with_bbox=True),
                dict(
                    type='Resize',
                    img_scale=[(960, 960)],
                    multiscale_mode='range',
                    keep_ratio=True,
                    backend='pillow'),
                dict(type='RandomFlip', flip_ratio=0.5),
                dict(
                    type='RandomAffine',
                    max_rotate_degree=10,
                    max_translate_ratio=0.1,
                    max_shear_degree=2,
                    scaling_ratio_range=(0.9, 1.1)),
                dict(
                    type='Normalize',
                    mean=[123.675, 116.28, 103.53],
                    std=[58.395, 57.12, 57.375],
                    to_rgb=True),
                dict(type='Pad', size_divisor=128),
                dict(type='DefaultFormatBundle'),
                dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
            ])),
    val=dict(
        type='CocoDataset',
        classes=('abc', ),
        filter_empty_gt=False,
        ann_file=
        '/mnt/data/dataset/abc-detection/labels/coco.json',
        img_prefix=
        '/mnt/data/dataset/abc-detection/images/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(960, 960),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True, backend='pillow'),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=128),
                    dict(type='DefaultFormatBundle'),
                    dict(type='Collect', keys=['img'])
                ])
        ]),
    test=dict(
        type='CocoDataset',
        classes=('abc', ),
        filter_empty_gt=False,
        ann_file=
        '/mnt/data/dataset/abc-detection/labels/coco.json',
        img_prefix=
        '/mnt/data/dataset/abc-detection/images/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(960, 960),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True, backend='pillow'),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=128),
                    dict(type='DefaultFormatBundle'),
                    dict(type='Collect', keys=['img'])
                ])
        ]))
evaluation = dict(interval=1, metric='bbox')
optimizer_config = dict(grad_clip=None)
optimizer = dict(
    type='AdamW',
    lr=5e-05,
    betas=(0.9, 0.999),
    weight_decay=0.05,
    paramwise_cfg=dict(
        custom_keys=dict(
            absolute_pos_embed=dict(decay_mult=0.0),
            relative_position_bias_table=dict(decay_mult=0.0),
            norm=dict(decay_mult=0.0))))
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=0.001,
    step=[8, 11])
runner = dict(type='EpochBasedRunner', max_epochs=12)
work_dir = '/mnt/data/ai-exp/abc-detection'
auto_resume = False
gpu_ids = [0]

/workdir/latest.pth is too large to post in here... (around 2.4GB) the main difference for the official config is instead of original coco classes, I train it to detect single class abc and the imgscale size have change. and I use the official checkpoint as pre-trained and train with single class detection

please kindly help!

Sep 01 '22 08:09 tak-ho

I am using mmdeploy master branch (21775ce584cf9589ebade133e46d008ee60acf9d), I don't know if this information helps... and I suspect the score will be a lot low too if using official provided check point and config. Does mmdeploy not supporting Dyhead at this stage?

Thanks

Sep 01 '22 11:09 tak-ho

The original Dyhead config and the pretrained model is in this page https://github.com/open-mmlab/mmdetection/tree/master/configs/dyhead The style is caffe I don't know whether it will lead to any error.

Do I need to provide any more information @hanrui1sensetime ? Thanks!!

Sep 07 '22 07:09 tak-ho

are you solve it ? I convert another model and have the same problem

Sep 13 '22 01:09 Chen-cyw

are you solve it ? I convert another model and have the same problem

I still waiting for the reply, haven't solve it yet...

Sep 13 '22 07:09 tak-ho

@hanrui1sensetime I have test again using the official config and checkpoint, they do have different result.

deploy config: https://workdrive.zohoexternal.com/external/0397370f2292821f4995ea6051201401c2538be4fb6aea2f19c796c07be399f7 model config: https://github.com/open-mmlab/mmdetection/blob/master/configs/dyhead/atss_swin-l-p4-w12_fpn_dyhead_mstrain_2x_coco.py model checkpoint/weight: https://download.openmmlab.com/mmdetection/v2.0/dyhead/atss_swin-l-p4-w12_fpn_dyhead_mstrain_2x_coco/atss_swin-l-p4-w12_fpn_dyhead_mstrain_2x_coco_20220509_100315-bc5b6516.pth testing image: https://workdrive.zohoexternal.com/external/74105a80a4d3bfb2edebffc3f2cded21b8d348a07ef3b6e103ef00d7b755637f

The result are different, highest score among detections from the tensorrt engine is 0.22415251 but the mmdet highest score among detections is 0.71841

Could you please kindly have a look? Thanks

Sep 19 '22 03:09 tak-ho

@lvhan028 @hanrui1sensetime I think this comment (https://github.com/open-mmlab/mmdeploy/issues/972#issuecomment-1250529149) can already reproduce the issue. Please let me know if I need to provide any extra information for investigating? Thanks!

Sep 27 '22 07:09 tak-ho

This model is too big to debug(2h for convert), so it uses me much time. The reason is because DyDCNv2 has different shape among x, offset and mask when using self.spatial_conv_high and self.spatial_conv_low, which will lead wrong result in TensorRT.

Nov 11 '22 10:11 hanrui1sensetime

Thanks for the investigation! Then should I just wait for the fix or there is some workaround I could do on the config side?

Nov 14 '22 01:11 tak-ho

ation! Then should I just wait for the fix or there is some workaround I could do on the config side?

The official config of mmdeploy still have this problem, so it is no need to change deploy config now.

Nov 14 '22 02:11 hanrui1sensetime

This code below can minimum reproduce your bug below about why onnxruntime works fine and trt drops score. If you want to run ort backend, please change the comments of deploy config in the example code.

from mmdet.models.necks.dyhead import DyHeadBlock
import torch
import torch.nn.functional as F

from mmdeploy.codebase import import_codebase
from mmdeploy.core.rewriters.rewriter_manager import RewriterContext
from mmdeploy.utils import Backend, Codebase
from mmdeploy.utils.test import (WrapFunction, WrapModel, backend_checker,
                                 check_backend, get_model_outputs,
                                 get_onnx_model, get_rewrite_outputs)

from mmengine import Config

class TestModule(torch.nn.Module):
    def __init__(self):
        super(TestModule, self).__init__()
        self.spatial_conv_high = DyHeadBlock(256, 256).spatial_conv_high
    def forward(self, x, offset, mask):
        return self.spatial_conv_high(x, offset, mask)

model = TestModule()
model.requires_grad_(False)
# x = torch.load('x1.pt')
x = torch.rand(1, 256, 50, 76)
# offset = torch.load('offset.pt')
offset = torch.rand(1, 256, 100, 152)
# mask = torch.load('mask.pt')
mask = torch.rand(1, 256, 100, 152)
print(f'debugging x: {x}, offset: {offset}, mask: {mask}')
model_outputs = model.forward(x, offset, mask)
print(f'model output: {model_outputs}, shape: {model_outputs.shape}')
wrapped_model = WrapModel(model, 'forward')
rewrite_inputs = {
    'x': x,
    'offset': offset,
    'mask': mask
}
''' tensorrt config, please change it manually.
deploy_cfg = Config(
    dict(
        onnx_config=dict(output_names=None, input_shape=None),
        backend_config=dict(
            type='tensorrt',
            common_config=dict(
                fp16_mode=False, max_workspace_size=1 << 30),
            model_inputs=[
            dict(
                input_shapes=dict(
                    input=dict(min_shape=[1, 3, 32, 32],
                        opt_shape=[1, 3, 1024, 1024],
                        max_shape=[1, 256, 1024, 1024]
                    )
                )
            )
        ]),
        codebase_config=dict(
            type='mmdet',
            task='ObjectDetection')))
'''
deploy_cfg = Config(
    dict(
        onnx_config=dict(output_names=None, input_shape=None),
        backend_config=dict(
            type='onnxruntime'),
        codebase_config=dict(
            type='mmdet',
            task='ObjectDetection')))
rewrite_outputs, is_backend_output = get_rewrite_outputs(
    wrapped_model=wrapped_model,
    model_inputs=rewrite_inputs,
    deploy_cfg=deploy_cfg)

print(f'rewrite_outputs: {rewrite_outputs}, rewrite_outputs[0].shape: {rewrite_outputs[0].shape}')

Nov 14 '22 09:11 hanrui1sensetime

Hi there, I'm facing the same issue. Is there any solution?

Nov 14 '22 17:11 vedrusss

@hanrui1sensetime Thanks for the onnx workaround, but I think I will need to wait for the fix for tensorrt version

Nov 15 '22 01:11 tak-ho

Hi there, I'm facing the same issue. Is there any solution?

The solution is to fix the implementation of this.

Nov 21 '22 11:11 hanrui1sensetime

I've compared modulated_deform_conv implementation for backends onnxruntime and tensorrt and found the discrepancy: while in onnxruntime backend implementation https://github.com/open-mmlab/mmdeploy/blob/master/csrc/mmdeploy/backend_ops/onnxruntime/modulated_deform_conv/modulated_deform_conv.cpp#:~:text=void%20deformable_conv2d_ref_fp32 offset and mask steps are calculated using output tensor shape

deformable_im2col_2d<float>(
      src + b * src_c * src_h * src_w + g * ic_per_gp * src_h * src_w,
      offset + b * offset_group * 2 * kernel_h * kernel_w * dst_h * dst_w,
      mask + b * offset_group * kernel_h * kernel_w * dst_h * dst_w, 
      src_h, src_w, kernel_h,
      kernel_w, pad_h, pad_w, stride_h, stride_w, dilation_h, dilation_w, ic_per_gp,
      offset_group, dst_h, dst_w, mask != nullptr, columns);

the implementation for tensorrt https://github.com/open-mmlab/mmdeploy/blob/master/csrc/mmdeploy/backend_ops/tensorrt/modulated_deform_conv/trt_modulated_deform_conv_kernel.cu#:~:text=typename%20scalar_t%3E-,void%20ModulatedDeformConvForwardCUDAKernelLauncher,-( uses input tensor shape:

const size_t offset_step = deformable_group * kernel_h * kernel_w * 2 * height * width;
const size_t mask_step = deformable_group * kernel_h * kernel_w *  height * width;

I've tried to use output tensort shape in tensorrt implementation (height_out and width_out), re-built the library and model engine. But this didn't solve the issue.

Nov 22 '22 09:11 vedrusss

@vedrusss I do the same things as you did but no luck...

Nov 23 '22 10:11 tak-ho

I found that before going inside ModulatedDeformConvForwardCUDAKernelLauncher (trt_modulated_deform_conv_kernel.cu), in the mmdeploy/csrc/mmdeploy/backend_ops/tensorrt/modulated_deform_conv/trt_modulated_deform_conv.cpp, the outputDesc (outputDesc[0].dims.d[2], outputDesc[0].dims.d[3]) already indicated the output height width is [100, 152] instead of the correct [50, 76]

is that normal?

Nov 24 '22 06:11 tak-ho

Is there anywhere math description of what algorithm should be implemented there? @hanrui1sensetime , @tak-ho-raspect ?

Nov 24 '22 09:11 vedrusss

outputDesc[0].dims.d[2]

Looks like outputDesc[0] is not passed to the ModulatedDeformConvForwardCUDAKernelLauncher as is, only its one element int channels_out = outputDesc[0].dims.d[1]. BTW where from do you get correct output height and width values? @tak-ho-raspect

Nov 25 '22 20:11 vedrusss

@vedrusss I am using the sample code provided and x is 50, 76 x = torch.rand(1, 256, 50, 76) the output size from onnx result are same as the input size, so I just guess the "correct" height width should be the same with the input

Nov 28 '22 01:11 tak-ho

@tak-ho-raspect , can you provide your sample code to run onnx model?

Nov 28 '22 15:11 vedrusss

I've found a mention about onnx implementation of MMCVDeformConv2d here. The author writes the implementation of MMCVModulatedDeformConv2d differs from MMCVDeformConv2d. Maybe this is a reason of discrepancy in detections?

Nov 28 '22 16:11 vedrusss

@vedrusss I use the code provided by @hanrui1sensetime

from turtle import forward
from mmdet.models.necks.dyhead import DyHeadBlock
import torch
import torch.nn.functional as F

from mmdeploy.codebase import import_codebase
from mmdeploy.core.rewriters.rewriter_manager import RewriterContext
from mmdeploy.utils import Backend, Codebase
from mmdeploy.utils.test import (WrapFunction, WrapModel, backend_checker,
                                 check_backend, get_model_outputs,
                                 get_onnx_model, get_rewrite_outputs)

from mmengine import Config

class TestModule(torch.nn.Module):
    def __init__(self):
        super(TestModule, self).__init__()
        self.spatial_conv_high = DyHeadBlock(256, 256).spatial_conv_high
    def forward(self, x, offset, mask):
        return self.spatial_conv_high(x, offset, mask)

model = TestModule()
model.requires_grad_(False)
# x = torch.load('x1.pt')
x = torch.rand(1, 256, 50, 76)
# offset = torch.load('offset.pt')
offset = torch.rand(1, 256, 100, 152)
# mask = torch.load('mask.pt')
mask = torch.rand(1, 256, 100, 152)
print(f'debugging x: {x}, offset: {offset}, mask: {mask}')
model_outputs = model.forward(x, offset, mask)
print(f'model output: {model_outputs}, shape: {model_outputs.shape}')
wrapped_model = WrapModel(model, 'forward')
rewrite_inputs = {
    'x': x,
    'offset': offset,
    'mask': mask
}
''' tensorrt config, please change it manually.
deploy_cfg = Config(
    dict(
        onnx_config=dict(output_names=None, input_shape=None),
        backend_config=dict(
            type='tensorrt',
            common_config=dict(
                fp16_mode=False, max_workspace_size=1 << 30),
            model_inputs=[
            dict(
                input_shapes=dict(
                    input=dict(min_shape=[1, 3, 32, 32],
                        opt_shape=[1, 3, 1024, 1024],
                        max_shape=[1, 256, 1024, 1024]
                    )
                )
            )
        ]),
        codebase_config=dict(
            type='mmdet',
            task='ObjectDetection')))
'''
deploy_cfg = Config(
    dict(
        onnx_config=dict(output_names=None, input_shape=None),
        backend_config=dict(
            type='onnxruntime'),
        codebase_config=dict(
            type='mmdet',
            task='ObjectDetection')))
rewrite_outputs, is_backend_output = get_rewrite_outputs(
    wrapped_model=wrapped_model,
    model_inputs=rewrite_inputs,
    deploy_cfg=deploy_cfg)

print(f'rewrite_outputs: {rewrite_outputs}, rewrite_outputs[0].shape: {rewrite_outputs[0].shape}')

Nov 29 '22 01:11 tak-ho

Are there any updates?

Dec 04 '22 19:12 vedrusss

Are there any updates?

We will fix the TRTModulatedDeformConv2d kernel later.

Dec 05 '22 09:12 hanrui1sensetime

@tak-ho-raspect, @hanrui1sensetime , I've found the discrepancy between rewrite_outputs (in test code upper) disappears if here one replaces index from 1 (taking offsets shape as output shape) to 0 (taking input shape as output shape):

ret.d[2] = inputs[0].d[2]; ret.d[3] = inputs[0].d[3];

But after such fix onnx2tensorrt convertion crashes with error about wrong dimensions in graph node

""" [12/08/2022-01:44:30] [TRT] [I] MatMul_1915: broadcasting input1 to make tensors conform, dims(input0)=[20,144,1536][NONE] dims(input1)=[1,1536,1536][NONE]. [12/08/2022-01:44:30] [TRT] [E] [graphShapeAnalyzer.cpp::analyzeShapes::1285] Error Code 4: Miscellaneous (IElementWiseLayer Add_2075: broadcast dimensions must be conformable) Traceback (most recent call last): File "tools/onnx2tensorrt.py", line 73, in main() File "tools/onnx2tensorrt.py", line 58, in main from_onnx( File "/root/workspace/mmdeploy/mmdeploy/backend/tensorrt/utils.py", line 165, in from_onnx raise RuntimeError(f'Failed to parse onnx, {error_msgs}') RuntimeError: Failed to parse onnx, In node 2075 (parseGraph): INVALID_NODE: Invalid Node - Add_2075 [graphShapeAnalyzer.cpp::analyzeShapes::1285] Error Code 4: Miscellaneous (IElementWiseLayer Add_2075: broadcast dimensions must be conformable) """

Dec 08 '22 01:12 vedrusss

@tak-ho-raspect, @hanrui1sensetime , I've found the discrepancy between rewrite_outputs (in test code upper) disappears if here one replaces index from 1 (taking offsets shape as output shape) to 0 (taking input shape as output shape):

ret.d[2] = inputs[0].d[2]; ret.d[3] = inputs[0].d[3];

@vedrusss I remember I changed that before but not working, the output shape and value will be the same as onnx version after changing this?

Dec 08 '22 01:12 tak-ho

mmdeploy mmdeploy copied to clipboard

Score/confidence of prediction drop a lot after convert to trt engine

mmdeploy
mmdeploy copied to clipboard