YOLO-World Large performance gap between pytorch model and onnx model

Thank for your excellent work!

I just exported the pytorch model to onnx model without nms, and used deploy/onnx_demo.py for detecting facial features in images. However I found that the results of onnx model is much different from the original pytorch model. I used the lasted code, and the texts I used is "person head, face, eye, nose, mouth".

The onnx conversion command: PYTHONPATH=./ python deploy/export_onnx.py ./configs/pretrain/yolo_world_v2_s_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py ../YOLO-World_bkp/checkpoints/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth --custom-text ../YOLO-World_bkp/custom.json --opset 12 --without-nms

Onnx inference command:
python deploy/onnx_demo.py ./work_dirs/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.onnx ../YOLO-World_bkp/demo1.jpg "person head, face, eye, nose, mouth" --onnx-nms

The result of pytorch model: eyst_pytorch

and the result of onnx model: eyst_onnx

May 20 '24 08:05 Deephome

I encountered the same problem as Deephome, the inference results based on the Pytorch model and the inference results based on the onnx model are inconsistent

May 21 '24 03:05 syn0419

Hi @Deephome and @syn0419, have you ever tried to export models without --without-nms?

May 21 '24 12:05 wondervictor

Hi @Deephome and @syn0419, have you ever tried to export models without --without-nms?

Yes, I encountered this issue before. Even after updating the project and following the suggestion to use "--without-nms", the problem of inconsistent results persisted. Upon investigation, the input to the model was the same, but the results were already different before the box decoder. Therefore, it was considered whether some operations in the exported ONNX model had deviated.

May 22 '24 03:05 syn0419

Hi @Deephome and @syn0419, have you ever tried to export models without --without-nms?

Yes, I encountered this issue before. Even after updating the project and following the suggestion to use "--without-nms", the problem of inconsistent results persisted. Upon investigation, the input to the model was the same, but the results were already different before the box decoder. Therefore, it was considered whether some operations in the exported ONNX model had deviated.

@wondervictor @syn0419 I found that even the result of backbone.image_model.stem is different between onnx model and pytorch model, although the inputs and conv weights are same. So I suspect that the convolution operation of onnx is different from pytorch. Which verison of onnx & onnxruntime should be used?

May 22 '24 03:05 Deephome

Hi @Deephome and @syn0419, have you ever tried to export models without --without-nms?

Yes, I encountered this issue before. Even after updating the project and following the suggestion to use "--without-nms", the problem of inconsistent results persisted. Upon investigation, the input to the model was the same, but the results were already different before the box decoder. Therefore, it was considered whether some operations in the exported ONNX model had deviated.

@wondervictor @syn0419 I found that even the result of backbone.image_model.stem is different between onnx model and pytorch model, although the inputs and conv weights are same. So I suspect that the convolution operation of onnx is different from pytorch. Which verison of onnx & onnxruntime should be used?

@Deephome @wondervictor I have the same suspicion, but I am not certain whether the conv weights of the ONNX model's backbone are consistent with those of the Pytorch model. The ONNX model includes non-intuitive parameter names. How did you compare them one by one?

May 22 '24 06:05 syn0419

@syn0419 I only check the weight of backbone.image_model.stem.conv, which is the first convolution of the network. The mean error of first conv's feature map is about 2e-08. I'm not sure if this error is within the normal range for onnx deployment, or yolo-world is too sensitive to such computation offset?

May 23 '24 08:05 Deephome

Hi all @Deephome and @syn0419, I've updated the inference demo:

def inference_with_postprocessing(ort_session,
                                  image_path,
                                  texts,
                                  output_dir,
                                  size=(640, 640),
                                  nms_thr=0.65,
                                  score_thr=0.05,
                                  max_dets=300):
    # export with `--without-nms`
    ori_image = cv2.imread(image_path)
    h, w = ori_image.shape[:2]
    image, scale_factor, pad_param = preprocess(ori_image[:, :, [2, 1, 0]],
                                                size)
    input_ort = ort.OrtValue.ortvalue_from_numpy(image.transpose((0, 3, 1, 2)))
    results = ort_session.run(["scores", "boxes"], {"images": input_ort})
    scores, bboxes = results
    # move numpy array to torch
    ori_scores = torch.from_numpy(scores[0]).to('cuda:0')
    ori_bboxes = torch.from_numpy(bboxes[0]).to('cuda:0')

    scores, labels = torch.max(ori_scores, dim=1)
    keep_idx = (scores > 0.001)
    bboxes = ori_bboxes[keep_idx]
    scores = scores[keep_idx]
    labels = labels[keep_idx]

    # batched nms
    bbox_inds = batched_nms(bboxes, scores, labels, iou_threshold=nms_thr)

    scores = scores[bbox_inds]
    bboxes = bboxes[bbox_inds]
    labels = labels[bbox_inds]

    scores_list = []
    labels_list = []
    bboxes_list = []

    if bbox_inds.shape[0] > max_dets:
        for cls_id in range(len(texts)):
            scores_cls = scores[labels == cls_id]
            print(scores_cls.shape)
            if scores_cls.shape[0] == 0:
                continue
            _, index = scores_cls.sort()
            keep_inds = index[:max_dets]
            box_cls = bboxes[labels == cls_id][keep_inds]
            scores_cls = scores[labels == cls_id][keep_inds]
            labels_cls = labels[labels == cls_id][keep_inds]

            scores_list.append(scores_cls)
            labels_list.append(labels_cls)
            bboxes_list.append(box_cls)
        scores = torch.cat(scores_list, dim=0)
        labels = torch.cat(labels_list, dim=0)
        bboxes = torch.cat(bboxes_list, dim=0)


    keep_idxs = scores > score_thr
    scores = scores[keep_idxs]
    labels = labels[keep_idxs]
    bboxes = bboxes[keep_idxs]

    # Get candidate predict info by num_dets
    scores = scores.cpu().numpy()
    bboxes = bboxes.cpu().numpy()
    labels = labels.cpu().numpy()

    bboxes -= np.array(
        [pad_param[1], pad_param[0], pad_param[1], pad_param[0]])
    bboxes /= scale_factor
    bboxes[:, 0::2] = np.clip(bboxes[:, 0::2], 0, w)
    bboxes[:, 1::2] = np.clip(bboxes[:, 1::2], 0, h)
    bboxes = bboxes.round().astype('int')

    image_out = visualize(ori_image, bboxes, labels, scores, texts)
    cv2.imwrite(osp.join(output_dir, osp.basename(image_path)), image_out)
    return image_out

May 25 '24 16:05 wondervictor

@syn0419 I only check the weight of backbone.image_model.stem.conv, which is the first convolution of the network. The mean error of first conv's feature map is about 2e-08. I'm not sure if this error is within the normal range for onnx deployment, or yolo-world is too sensitive to such computation offset?

it's normal.

May 25 '24 16:05 wondervictor

@wondervictor Thank you for your update on inference_with_postprocessing() function, but it doesn't solve the problem. I compared the image feature maps of three sclaes (i.e. outputs of stage 3/4/5) between pytorch model and onnx model, the max pixel-wise error can reach about 0.001, 0.014, 0.016, respectively. It seems abnormal for onnx deployment.

May 27 '24 07:05 Deephome

Hi @Deephome, thanks for reminding me of it and I will further optimize the process to find out. Indeed, there are some operators which are replaced for ONNX and I'll check the necessity right now.

May 27 '24 08:05 wondervictor

Hi all @Deephome and @syn0419, I've updated the inference demo:

def inference_with_postprocessing(ort_session,
                                  image_path,
                                  texts,
                                  output_dir,
                                  size=(640, 640),
                                  nms_thr=0.65,
                                  score_thr=0.05,
                                  max_dets=300):
    # export with `--without-nms`
    ori_image = cv2.imread(image_path)
    h, w = ori_image.shape[:2]
    image, scale_factor, pad_param = preprocess(ori_image[:, :, [2, 1, 0]],
                                                size)
    input_ort = ort.OrtValue.ortvalue_from_numpy(image.transpose((0, 3, 1, 2)))
    results = ort_session.run(["scores", "boxes"], {"images": input_ort})
    scores, bboxes = results
    # move numpy array to torch
    ori_scores = torch.from_numpy(scores[0]).to('cuda:0')
    ori_bboxes = torch.from_numpy(bboxes[0]).to('cuda:0')

    scores, labels = torch.max(ori_scores, dim=1)
    keep_idx = (scores > 0.001)
    bboxes = ori_bboxes[keep_idx]
    scores = scores[keep_idx]
    labels = labels[keep_idx]

    # batched nms
    bbox_inds = batched_nms(bboxes, scores, labels, iou_threshold=nms_thr)

    scores = scores[bbox_inds]
    bboxes = bboxes[bbox_inds]
    labels = labels[bbox_inds]

    scores_list = []
    labels_list = []
    bboxes_list = []

    if bbox_inds.shape[0] > max_dets:
        for cls_id in range(len(texts)):
            scores_cls = scores[labels == cls_id]
            print(scores_cls.shape)
            if scores_cls.shape[0] == 0:
                continue
            _, index = scores_cls.sort()
            keep_inds = index[:max_dets]
            box_cls = bboxes[labels == cls_id][keep_inds]
            scores_cls = scores[labels == cls_id][keep_inds]
            labels_cls = labels[labels == cls_id][keep_inds]

            scores_list.append(scores_cls)
            labels_list.append(labels_cls)
            bboxes_list.append(box_cls)
        scores = torch.cat(scores_list, dim=0)
        labels = torch.cat(labels_list, dim=0)
        bboxes = torch.cat(bboxes_list, dim=0)


    keep_idxs = scores > score_thr
    scores = scores[keep_idxs]
    labels = labels[keep_idxs]
    bboxes = bboxes[keep_idxs]

    # Get candidate predict info by num_dets
    scores = scores.cpu().numpy()
    bboxes = bboxes.cpu().numpy()
    labels = labels.cpu().numpy()

    bboxes -= np.array(
        [pad_param[1], pad_param[0], pad_param[1], pad_param[0]])
    bboxes /= scale_factor
    bboxes[:, 0::2] = np.clip(bboxes[:, 0::2], 0, w)
    bboxes[:, 1::2] = np.clip(bboxes[:, 1::2], 0, h)
    bboxes = bboxes.round().astype('int')

    image_out = visualize(ori_image, bboxes, labels, scores, texts)
    cv2.imwrite(osp.join(output_dir, osp.basename(image_path)), image_out)
    return image_out

大佬您好，后处理这里的 bboxes -= np.array( [pad_param[1], pad_param[0], pad_param[1], pad_param[0]]) bboxes /= scale_factor onnx推理的这两行互换一下比较好？因为上面pad_param中获得的pad是在原图上计算得到的。再除以缩放系数的话会有误差。导致检测框会有偏移

Jul 05 '24 07:07 Gakkifan1314

@Gakkifan1314 Thank you for your suggestion, I already found and fixed this bug before I submitted this issue, the performance gap caused by onnx inference is acutually not related to this bug. Have you ever evaluated the performace of onnx model in detail?

Jul 06 '24 08:07 Deephome

Hi @wondervictor, may I ask if there is any update on the batched_nms() for inference_with_postprocessing() above? In the trt_nms.py, the scores seems not in the expected format.

Jul 10 '24 02:07 kezhang-cs

any updates for the issue? faced the same problem.

Jul 26 '24 06:07 ForestWang

@ForestWang @Deephome and anyone who is still interest in this, I found out that for export_onnx.py, you'll have to add --add-padding flag for the sake of aligned text input as well as results with those of pytorch, that's the major issue here.

Secondly and a rather minor issue, the image preprocessing in onnx_demo.py seems pretty odd with padding first then resize, which could lead to interpolation issues with image pixels and pad values and that could lead to a minor gap between original and onnx image input.

def preprocess(image, size=(640, 640)):
    h, w = image.shape[:2]
    max_size = max(h, w)
    scale_factor = size[0] / max_size
    pad_h = (max_size - h) // 2
    pad_w = (max_size - w) // 2
    pad_image = np.zeros((max_size, max_size, 3), dtype=image.dtype)
    pad_image[pad_h:h + pad_h, pad_w:w + pad_w] = image
    image = cv2.resize(pad_image, size,
                       interpolation=cv2.INTER_LINEAR).astype('float32')
    image /= 255.0
    image = image[None]
    return image, scale_factor, (pad_h, pad_w)

Aug 27 '24 20:08 LongIslandWithoutIceTea