super-gradients ONNX export to support TensorRT NMS plugin registration like yolov7

🚀 Feature Request

If I export yoloNAS as an onnx and inspect with netron, I see the inputs/outputs are like this:

Which means, to use with TensorRT, I will have to write my own NMS algorithm, probably running on CPU, to get the final bounding boxes. However, TensorRT has an efficient GPU accelerated NMS plugin https://github.com/NVIDIA/TensorRT/tree/main/plugin/efficientNMSPlugin that can be registered in an ONNX file.

Proposed Solution

Yolov7 does this, which makes overall end to end execution very fast, possibly faster than yoloNAS + cpu NMS at a given resolution / mAP https://github.com/WongKinYiu/yolov7/blob/main/export.py

this is the code to do it: https://github.com/WongKinYiu/yolov7/blob/3b41c2cc709628a8c1966931e696b14c11d6db0c/utils/add_nms.py#L72

I would potentially be up for contributing this change if it would be welcome, with a little support?

May 25 '23 20:05 LukeAI

I spent 2 days figuring out why, when I switched from yolo7 to yolo-nas, the onnx output formats are not similar. Any "postprocessing" that will give out reasonable bounding boxes with class and confidence inside the onnx would be really great.

May 26 '23 20:05 mmax3

what do you think @BloodAxe ? honestly, without the GPU accelerated NMS, I doubt that yoloNAS is as good as yolov7 on the latency/accuracy curve.

May 26 '23 20:05 LukeAI

+1 for this, the postprocessing is absolutely necessary for good throughput. It should be fairly easy to port this from the existing yolo variants.

May 31 '23 09:05 philipp-schmidt

Relevant code for this is here:

https://github.com/WongKinYiu/yolov7/blob/84932d70fb9e2932d0a70e4a1f02a1d6dd1dd6ca/models/experimental.py#L111

Classes ORT_NMS, TRT_NMS, ONNX_ORT, ONNX_TRT, End2End should be compatible with yolo nas I believe.

May 31 '23 11:05 philipp-schmidt

Also they enable native ONNX NMS as well if I'm not mistaken. So not only can the engine exported to TensorRT, but also the native ONNX backends work with NMS as well.

May 31 '23 11:05 philipp-schmidt

You can attach the additional nn.Module to the onnx exporter as and it will be processed by ConvertableCompletePipelineModel

Example code:

import torch
import torch.nn as nn

class PatchDeepStreamOutput(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        boxes, confscores = x
        scores, classes = torch.max(confscores, 2, keepdim=True)
        return torch.cat((boxes, scores, classes), dim=2)
    
deepstream_output_patch = PatchDeepStreamOutput()
deepstream_output_patch.eval()

models.convert_to_onnx(model=net, input_shape=(3,640,640), post_process=deepstream_output_patch, out_path=os.path.join(trainer.checkpoints_dir_path, "best_ds.onnx"))

You can use it for DeepStream plugins from here marcoslucianops/DeepStream-Yolo.

If you want to add the EfficientNMSTRT, just add the End2End module to the postprocessing. Note that the YOLO-NAS bbox output format is BoxCorner[xyxy], different with other yolo variants BoxCenter[cxcywh]. So you need to adjust the box_coding parameter to 0

Jun 02 '23 05:06 haritsahm

@haritsahm

Thanks for your advice!

I've been trying to implement what you suggest but it's not quite working for me. Where am I going wrong?

I've tried quite a few variations, but this is what I'm working with at the moment:

#!/usr/bin/env python
from super_gradients.training import models
from super_gradients.common.object_names import Models
import torch
import torch.nn as nn

class TRT_NMS(torch.autograd.Function):
    '''TensorRT NMS operation'''
    @staticmethod
    def forward(
        ctx,
        boxes,
        scores,
        background_class=-1,
        box_coding=0,
        iou_threshold=0.45,
        max_output_boxes=100,
        plugin_version="1",
        score_activation=0,
        score_threshold=0.25,
    ):
        batch_size, num_boxes, num_classes = scores.shape
        num_det = torch.randint(0, max_output_boxes, (batch_size, 1), dtype=torch.int32)
        det_boxes = torch.randn(batch_size, max_output_boxes, 4)
        det_scores = torch.randn(batch_size, max_output_boxes)
        det_classes = torch.randint(0, num_classes, (batch_size, max_output_boxes), dtype=torch.int32)
        return num_det, det_boxes, det_scores, det_classes

    @staticmethod
    def symbolic(g,
                 boxes,
                 scores,
                 background_class=-1,
                 box_coding=0,
                 iou_threshold=0.45,
                 max_output_boxes=100,
                 plugin_version="1",
                 score_activation=0,
                 score_threshold=0.25):
        out = g.op("TRT::EfficientNMS_TRT",
                   boxes,
                   scores,
                   background_class_i=background_class,
                   box_coding_i=box_coding,
                   iou_threshold_f=iou_threshold,
                   max_output_boxes_i=max_output_boxes,
                   plugin_version_s=plugin_version,
                   score_activation_i=score_activation,
                   score_threshold_f=score_threshold,
                   outputs=4)
        nums, boxes, scores, classes = out
        return nums, boxes, scores, classes

class ONNX_TRT(nn.Module):
    '''onnx module with TensorRT NMS operation.'''
    def __init__(self, max_obj=100, iou_thres=0.45, score_thres=0.25, max_wh=None ,device=None, n_classes=80):
        super().__init__()
        assert max_wh is None
        self.device = device if device else torch.device('cpu')
        self.background_class = -1,
        self.box_coding = 1,
        self.iou_threshold = iou_thres
        self.max_obj = max_obj
        self.plugin_version = '1'
        self.score_activation = 0
        self.score_threshold = score_thres
        self.n_classes=n_classes

    def forward(self, x):
        boxes, confscores = x
        scores, classes = torch.max(confscores, 2, keepdim=True)
        print("boxes.shape ", boxes.shape)
        print("confscores.shape ", confscores.shape)
        num_det, det_boxes, det_scores, det_classes = TRT_NMS.apply(boxes, scores, self.background_class, self.box_coding,
                                                                    self.iou_threshold, self.max_obj,
                                                                    self.plugin_version, self.score_activation,
                                                                    self.score_threshold)
        return num_det, det_boxes, det_scores, det_classes


net = models.get(Models.YOLO_NAS_S, pretrained_weights="coco")
net.eval()
end2end = ONNX_TRT()
end2end.eval()

models.convert_to_onnx(model=net, input_shape=(3,640,640), post_process=end2end, out_path="yolo_nas_s.onnx")

but when I run, I get:

boxes.shape  torch.Size([1, 8400, 4])
confscores.shape  torch.Size([1, 8400, 80])
./export.py:42: FutureWarning: 'torch.onnx._patch_torch._graph_op' is deprecated in version 1.13 and will be removed in version 1.14. Please note 'g.op()' is to be removed from torch.Graph. Please open a GitHub issue if you need this functionality..
  out = g.op("TRT::EfficientNMS_TRT",
/home/luke/.pyenv/versions/yoNAS/lib/python3.8/site-packages/torch/onnx/_patch_torch.py:81: UserWarning: The shape inference of TRT::EfficientNMS_TRT type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function. (Triggered internally at ../torch/csrc/jit/passes/onnx/shape_type_inference.cpp:1884.)
  _C._jit_pass_onnx_node_shape_type_inference(
/home/luke/.pyenv/versions/yoNAS/lib/python3.8/site-packages/torch/onnx/utils.py:687: UserWarning: The shape inference of TRT::EfficientNMS_TRT type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function. (Triggered internally at ../torch/csrc/jit/passes/onnx/shape_type_inference.cpp:1884.)
  _C._jit_pass_onnx_graph_shape_type_inference(
/home/luke/.pyenv/versions/yoNAS/lib/python3.8/site-packages/torch/onnx/utils.py:1178: UserWarning: The shape inference of TRT::EfficientNMS_TRT type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function. (Triggered internally at ../torch/csrc/jit/passes/onnx/shape_type_inference.cpp:1884.)
  _C._jit_pass_onnx_graph_shape_type_inference(

and a model with output like:

Jun 04 '23 21:06 LukeAI

@LukeAI , Is the error appear when exporting the model or after exporting/inferencing the model?

Jun 05 '23 04:06 haritsahm

when exporting - just when running that script^

Jun 05 '23 13:06 LukeAI

I think it'll be fine, it's just a warning

Jun 05 '23 14:06 haritsahm

ok, I can try. You can see in the netron screenshot that the output dimensions of the onnx aren't defined, whereas when I export yolov7, they are. I'll have to disable output dimensions checks too at inf. time.

Jun 05 '23 14:06 LukeAI

The reason they are defined for yolov7 is this manual step during export:

Jun 05 '23 15:06 philipp-schmidt

Basically ONNX can not know the output dimensions of the non-native plugin. So we have to specify the dimensions manually instead. Same way you can specify the names of the outputs.

Jun 05 '23 15:06 philipp-schmidt

ah ok, that makes sense. How could we add that here?

Jun 05 '23 15:06 LukeAI

Change the method "convert_to_onnx" to make that adjustment before serializing to file or
be lazy and just load the model from the onnx file again like yolov7 does it:

So something like this should do the trick (code really could be cleaner, but hey...):

# should be good to go if you have those options
shapes = [opt.batch_size, 1, opt.batch_size, opt.topk_all, 4,
opt.batch_size, opt.topk_all, opt.batch_size, opt.topk_all]

onnx_model = onnx.load("yolo_nas_s.onnx")  # load onnx model
onnx.checker.check_model(onnx_model)  # check onnx model

for i in onnx_model.graph.output:
    for j in i.type.tensor_type.shape.dim:
        j.dim_param = str(shapes.pop(0))

onnx.save(onnx_model,"yolo_nas_s_outdims.onnx")

Jun 05 '23 15:06 philipp-schmidt

Nice information. I previously use yolov6 which also has the same end2end to add EfficientNMS_TRT. But I've never done that because despite the warnings, the model works perfectly.

Jun 05 '23 15:06 haritsahm

The model dimensions in onnx have no function. TensorRT is able to understand the output dimensions of its own plugin, so the TensorRT engine will be good anyway. But I would consider it good "documentation" so people understand the output better (e.g. when they open it in netron)

Jun 05 '23 15:06 philipp-schmidt

@philipp-schmidt do you have any idea how to set the names in a similar way?

I tried this:

batch_size = 1
topk_all = 100
shapes = [batch_size, 1,
          batch_size, topk_all, 4,
          batch_size, topk_all,
          batch_size, topk_all]
names = ["num_dets", "det_boxes", "det_scores", "det_classes"]
onnx_model = onnx.load(model_path)  # load onnx model
onnx.checker.check_model(onnx_model)  # check onnx model
for i in onnx_model.graph.output:
    i.name = names.pop(0)
    for j in i.type.tensor_type.shape.dim:
        j.dim_param = str(shapes.pop(0))

and it works insofar as when I inspect the onnx with netron I can see the output names are labelled as I expect: applied but when I try and test with trtexec --int8 --fp16 --avgRuns=10 --onnx=yolo_nas_s.onnx I get segmentation error.

With not setting the name and just leaving the default, not particularly helpful names 915, 916, 917, 918 as below, trtexec will successfully run to completion as expected.

yolonas

Jun 11 '23 17:06 LukeAI

Please post the output of netron with your code applied.

Jun 11 '23 17:06 philipp-schmidt

Please post the output of netron with your code applied.

have added to above post^

Jun 11 '23 17:06 LukeAI

Inspecting the i object with dir() shows it has the following attributes: ['ByteSize', 'Clear', 'ClearExtension', 'ClearField', 'CopyFrom', 'DESCRIPTOR', 'DiscardUnknownFields', 'Extensions', 'FindInitializationErrors', 'FromString', 'HasExtension ', 'HasField', 'IsInitialized', 'ListFields', 'MergeFrom', 'MergeFromString', 'ParseFromString', 'RegisterExtension', 'SerializePartialToString', 'SerializeToString', 'SetIn Parent', 'UnknownFields', 'WhichOneof', '_CheckCalledFromGeneratedFile', '_SetListener', '__class__', '__deepcopy__', '__delattr__', '__dir__', '__doc__', '__eq__', '__forma t__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__ ', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__unicode__', '_extensions_by_name', '_extensions_b y_number', 'doc_string', 'name', 'type']

I just tried setting doc_string and that doesn't segfault and is an improvement over nothing:

doc_string

but it would be nice if I could set the actual name as well, just to make my decoding/inference code a bit easier to read and so that it could seamlessly interoperate with different models.

Jun 11 '23 18:06 LukeAI

In yolov7 they do this by directly passing input/output names to the torch.onnx.export function https://github.com/WongKinYiu/yolov7/blob/84932d70fb9e2932d0a70e4a1f02a1d6dd1dd6ca/export.py#L159

It looks like super-gradients convert_to_onnx DOES expose those arguments to us... https://github.com/Deci-AI/super-gradients/blob/a43cfcd70072c7be1231f9183b1a717c136ff657/src/super_gradients/training/models/conversion.py#L169

Jun 11 '23 18:06 LukeAI

ok I can confirm doing it with kwargs works:

model_path = "yolo_nas_s.onnx"
onnx_export_kwargs = {
    'input_names' : ['images'],
    'output_names' : ["num_dets", "det_boxes", "det_scores", "det_classes"]
}
models.convert_to_onnx(model=net, input_shape=(3,640,640), post_process=end2end, out_path=model_path,
                       torch_onnx_export_kwargs=onnx_export_kwargs)

kwargs

Jun 11 '23 18:06 LukeAI

Cool! No segfaults with that one?

Jun 13 '23 11:06 philipp-schmidt

can confirm no segfaults! no idea why it's different.

Jun 13 '23 14:06 LukeAI

can confirm no segfaults! no idea why it's different.

I think its because you're forcing to rename the nodes without telling the graph that the nodes have changed (I dont know how to explain it but I hope you got the idea). From you previous code sample:

for i in onnx_model.graph.output:
    i.name = names.pop(0)
    for j in i.type.tensor_type.shape.dim:
        j.dim_param = str(shapes.pop(0))

On the other hand, when we rename the outputs using the exporter, the graph already configured to have nodes with some names using torch_onnx_exporter_params as shown in torch onnx example

If you want to rename a node in the onnx graph, you should create a new Variable/Node and attach it to the original onnx model outputs (the numbers 915,916) and update the graph, using onnx-graphsurgeon similar to this example

Jun 16 '23 02:06 haritsahm

I've written inference code and after some experimenting everything is working, (ie. I get accurate looking bounding boxes) except that the class values are always reported as 0 when they should be in the range 0->80. I think it might have something to do with the parameter class_agnostic which we do not pass in the code above^ and should be Set to true to do class-independent NMS; otherwise, boxes of different classes would be considered separately during NMS. https://github.com/NVIDIA/TensorRT/tree/main/plugin/efficientNMSPlugin#parameters

But I can't workout how to pass that parameter correctly. If I pass class_agnostic_i=0, to g.op there is no behaviour change. If I pass any of: class_agnostic_i=1, class_agnostic_i=-1, class_agnostic_f=1, class_agnostic_f=-1. The onnx exports but does not load with trt.OnnxParser without error.

I might be wrong on that front, but any help on this front would be greatly appreciated! Once it's working can publish solution publically for others.

@haritsahm @philipp-schmidt

Export Code: https://gist.github.com/LukeAI/bbfc3ab749601ab0f2cb06e4b8fc75cb

Inference Code: https://gist.github.com/LukeAI/336a1fd9ea802d454d883342517a681f when ran with an image of 4 forks, I see the four forks correctly boxed and this prints:

Total Running time = 0.0066 seconds
[0.888565   0.8684023  0.86700696 0.8414705 ]  # det scores, looks about right
[0 0 0 0]  # det classes, shouldn't be zero.

Jun 16 '23 22:06 LukeAI

@LukeAI You should add class_agnostic as the parameter in forward and symbolic functions and modify the TRT_NMS.apply() arguments as needed.

def forward(
        ....
        ....
        class_agnostic=1)

def symbolic(
        ....
        ....
        class_agnostic=1)

I'm not sure, but I think you should pass class_agnostic_i in the g.Op.

I've never had incorrect results, except if I trained it wrong in the first place. Have you validated your torch model ouput? What is your performance metrics?

Jun 17 '23 00:06 haritsahm

hey thanks! Turned out to be unrelated mistake in the forward function, have updated the linked gist to correct code in case it helps somebody.

Jun 17 '23 01:06 LukeAI

Hey @LukeAI , @haritsahm @BloodAxe @mmax3 @philipp-schmidt , I am struggling to write the inference code for the yolonas onnx format . Code: input_name = session.get_inputs()[0].name output_names = [x.name for x in session.get_outputs()] ort_inputs = {input_name: im_np} ort_outputs = session.run(output_names, ort_inputs) Here ort_outputs is list where ort_outputs[0] is of shape (1,N,4) and ort_outputs[1] is of shape (1,N,80) for coco dataset. Can any of you please let me know , how to write the post processing function and the NMS handle !? so I can get boxes and scores and class just similar to what we get when we do model.predict() in pytorch !? Any help here will be deeply appreciated !!

Jul 21 '23 01:07 PrajwalCogniac

super-gradients super-gradients copied to clipboard

ONNX export to support TensorRT NMS plugin registration like yolov7

🚀 Feature Request

Proposed Solution

super-gradients
super-gradients copied to clipboard