super-gradients
super-gradients copied to clipboard
ONNX export to support TensorRT NMS plugin registration like yolov7
🚀 Feature Request
If I export yoloNAS as an onnx and inspect with netron, I see the inputs/outputs are like this:
Which means, to use with TensorRT, I will have to write my own NMS algorithm, probably running on CPU, to get the final bounding boxes. However, TensorRT has an efficient GPU accelerated NMS plugin https://github.com/NVIDIA/TensorRT/tree/main/plugin/efficientNMSPlugin that can be registered in an ONNX file.
Proposed Solution
Yolov7 does this, which makes overall end to end execution very fast, possibly faster than yoloNAS + cpu NMS at a given resolution / mAP https://github.com/WongKinYiu/yolov7/blob/main/export.py
this is the code to do it: https://github.com/WongKinYiu/yolov7/blob/3b41c2cc709628a8c1966931e696b14c11d6db0c/utils/add_nms.py#L72
I would potentially be up for contributing this change if it would be welcome, with a little support?
I spent 2 days figuring out why, when I switched from yolo7 to yolo-nas, the onnx output formats are not similar. Any "postprocessing" that will give out reasonable bounding boxes with class and confidence inside the onnx would be really great.
what do you think @BloodAxe ? honestly, without the GPU accelerated NMS, I doubt that yoloNAS is as good as yolov7 on the latency/accuracy curve.
+1 for this, the postprocessing is absolutely necessary for good throughput. It should be fairly easy to port this from the existing yolo variants.
Relevant code for this is here:
https://github.com/WongKinYiu/yolov7/blob/84932d70fb9e2932d0a70e4a1f02a1d6dd1dd6ca/models/experimental.py#L111
Classes ORT_NMS, TRT_NMS, ONNX_ORT, ONNX_TRT, End2End should be compatible with yolo nas I believe.
Also they enable native ONNX NMS as well if I'm not mistaken. So not only can the engine exported to TensorRT, but also the native ONNX backends work with NMS as well.
You can attach the additional nn.Module
to the onnx exporter as and it will be processed by ConvertableCompletePipelineModel
Example code:
import torch
import torch.nn as nn
class PatchDeepStreamOutput(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
boxes, confscores = x
scores, classes = torch.max(confscores, 2, keepdim=True)
return torch.cat((boxes, scores, classes), dim=2)
deepstream_output_patch = PatchDeepStreamOutput()
deepstream_output_patch.eval()
models.convert_to_onnx(model=net, input_shape=(3,640,640), post_process=deepstream_output_patch, out_path=os.path.join(trainer.checkpoints_dir_path, "best_ds.onnx"))
You can use it for DeepStream plugins from here marcoslucianops/DeepStream-Yolo.
If you want to add the EfficientNMSTRT
, just add the End2End module to the postprocessing. Note that the YOLO-NAS bbox output format is BoxCorner[xyxy]
, different with other yolo variants BoxCenter[cxcywh]
. So you need to adjust the box_coding
parameter to 0
@haritsahm
Thanks for your advice!
I've been trying to implement what you suggest but it's not quite working for me. Where am I going wrong?
I've tried quite a few variations, but this is what I'm working with at the moment:
#!/usr/bin/env python
from super_gradients.training import models
from super_gradients.common.object_names import Models
import torch
import torch.nn as nn
class TRT_NMS(torch.autograd.Function):
'''TensorRT NMS operation'''
@staticmethod
def forward(
ctx,
boxes,
scores,
background_class=-1,
box_coding=0,
iou_threshold=0.45,
max_output_boxes=100,
plugin_version="1",
score_activation=0,
score_threshold=0.25,
):
batch_size, num_boxes, num_classes = scores.shape
num_det = torch.randint(0, max_output_boxes, (batch_size, 1), dtype=torch.int32)
det_boxes = torch.randn(batch_size, max_output_boxes, 4)
det_scores = torch.randn(batch_size, max_output_boxes)
det_classes = torch.randint(0, num_classes, (batch_size, max_output_boxes), dtype=torch.int32)
return num_det, det_boxes, det_scores, det_classes
@staticmethod
def symbolic(g,
boxes,
scores,
background_class=-1,
box_coding=0,
iou_threshold=0.45,
max_output_boxes=100,
plugin_version="1",
score_activation=0,
score_threshold=0.25):
out = g.op("TRT::EfficientNMS_TRT",
boxes,
scores,
background_class_i=background_class,
box_coding_i=box_coding,
iou_threshold_f=iou_threshold,
max_output_boxes_i=max_output_boxes,
plugin_version_s=plugin_version,
score_activation_i=score_activation,
score_threshold_f=score_threshold,
outputs=4)
nums, boxes, scores, classes = out
return nums, boxes, scores, classes
class ONNX_TRT(nn.Module):
'''onnx module with TensorRT NMS operation.'''
def __init__(self, max_obj=100, iou_thres=0.45, score_thres=0.25, max_wh=None ,device=None, n_classes=80):
super().__init__()
assert max_wh is None
self.device = device if device else torch.device('cpu')
self.background_class = -1,
self.box_coding = 1,
self.iou_threshold = iou_thres
self.max_obj = max_obj
self.plugin_version = '1'
self.score_activation = 0
self.score_threshold = score_thres
self.n_classes=n_classes
def forward(self, x):
boxes, confscores = x
scores, classes = torch.max(confscores, 2, keepdim=True)
print("boxes.shape ", boxes.shape)
print("confscores.shape ", confscores.shape)
num_det, det_boxes, det_scores, det_classes = TRT_NMS.apply(boxes, scores, self.background_class, self.box_coding,
self.iou_threshold, self.max_obj,
self.plugin_version, self.score_activation,
self.score_threshold)
return num_det, det_boxes, det_scores, det_classes
net = models.get(Models.YOLO_NAS_S, pretrained_weights="coco")
net.eval()
end2end = ONNX_TRT()
end2end.eval()
models.convert_to_onnx(model=net, input_shape=(3,640,640), post_process=end2end, out_path="yolo_nas_s.onnx")
but when I run, I get:
boxes.shape torch.Size([1, 8400, 4])
confscores.shape torch.Size([1, 8400, 80])
./export.py:42: FutureWarning: 'torch.onnx._patch_torch._graph_op' is deprecated in version 1.13 and will be removed in version 1.14. Please note 'g.op()' is to be removed from torch.Graph. Please open a GitHub issue if you need this functionality..
out = g.op("TRT::EfficientNMS_TRT",
/home/luke/.pyenv/versions/yoNAS/lib/python3.8/site-packages/torch/onnx/_patch_torch.py:81: UserWarning: The shape inference of TRT::EfficientNMS_TRT type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function. (Triggered internally at ../torch/csrc/jit/passes/onnx/shape_type_inference.cpp:1884.)
_C._jit_pass_onnx_node_shape_type_inference(
/home/luke/.pyenv/versions/yoNAS/lib/python3.8/site-packages/torch/onnx/utils.py:687: UserWarning: The shape inference of TRT::EfficientNMS_TRT type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function. (Triggered internally at ../torch/csrc/jit/passes/onnx/shape_type_inference.cpp:1884.)
_C._jit_pass_onnx_graph_shape_type_inference(
/home/luke/.pyenv/versions/yoNAS/lib/python3.8/site-packages/torch/onnx/utils.py:1178: UserWarning: The shape inference of TRT::EfficientNMS_TRT type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function. (Triggered internally at ../torch/csrc/jit/passes/onnx/shape_type_inference.cpp:1884.)
_C._jit_pass_onnx_graph_shape_type_inference(
and a model with output like:
@LukeAI , Is the error appear when exporting the model or after exporting/inferencing the model?
when exporting - just when running that script^
I think it'll be fine, it's just a warning
ok, I can try. You can see in the netron screenshot that the output dimensions of the onnx aren't defined, whereas when I export yolov7, they are. I'll have to disable output dimensions checks too at inf. time.
The reason they are defined for yolov7 is this manual step during export:
Basically ONNX can not know the output dimensions of the non-native plugin. So we have to specify the dimensions manually instead. Same way you can specify the names of the outputs.
ah ok, that makes sense. How could we add that here?
- Change the method "convert_to_onnx" to make that adjustment before serializing to file or
- be lazy and just load the model from the onnx file again like yolov7 does it:
So something like this should do the trick (code really could be cleaner, but hey...):
# should be good to go if you have those options
shapes = [opt.batch_size, 1, opt.batch_size, opt.topk_all, 4,
opt.batch_size, opt.topk_all, opt.batch_size, opt.topk_all]
onnx_model = onnx.load("yolo_nas_s.onnx") # load onnx model
onnx.checker.check_model(onnx_model) # check onnx model
for i in onnx_model.graph.output:
for j in i.type.tensor_type.shape.dim:
j.dim_param = str(shapes.pop(0))
onnx.save(onnx_model,"yolo_nas_s_outdims.onnx")
Nice information. I previously use yolov6 which also has the same end2end to add EfficientNMS_TRT
. But I've never done that because despite the warnings, the model works perfectly.
The model dimensions in onnx have no function. TensorRT is able to understand the output dimensions of its own plugin, so the TensorRT engine will be good anyway. But I would consider it good "documentation" so people understand the output better (e.g. when they open it in netron)
@philipp-schmidt do you have any idea how to set the names in a similar way?
I tried this:
batch_size = 1
topk_all = 100
shapes = [batch_size, 1,
batch_size, topk_all, 4,
batch_size, topk_all,
batch_size, topk_all]
names = ["num_dets", "det_boxes", "det_scores", "det_classes"]
onnx_model = onnx.load(model_path) # load onnx model
onnx.checker.check_model(onnx_model) # check onnx model
for i in onnx_model.graph.output:
i.name = names.pop(0)
for j in i.type.tensor_type.shape.dim:
j.dim_param = str(shapes.pop(0))
and it works insofar as when I inspect the onnx with netron I can see the output names are labelled as I expect:
but when I try and test with
trtexec --int8 --fp16 --avgRuns=10 --onnx=yolo_nas_s.onnx
I get segmentation error.
With not setting the name and just leaving the default, not particularly helpful names 915, 916, 917, 918
as below, trtexec will successfully run to completion as expected.
Please post the output of netron with your code applied.
Please post the output of netron with your code applied.
have added to above post^
Inspecting the i
object with dir()
shows it has the following attributes:
['ByteSize', 'Clear', 'ClearExtension', 'ClearField', 'CopyFrom', 'DESCRIPTOR', 'DiscardUnknownFields', 'Extensions', 'FindInitializationErrors', 'FromString', 'HasExtension ', 'HasField', 'IsInitialized', 'ListFields', 'MergeFrom', 'MergeFromString', 'ParseFromString', 'RegisterExtension', 'SerializePartialToString', 'SerializeToString', 'SetIn Parent', 'UnknownFields', 'WhichOneof', '_CheckCalledFromGeneratedFile', '_SetListener', '__class__', '__deepcopy__', '__delattr__', '__dir__', '__doc__', '__eq__', '__forma t__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__ ', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__unicode__', '_extensions_by_name', '_extensions_b y_number', 'doc_string', 'name', 'type']
I just tried setting doc_string and that doesn't segfault and is an improvement over nothing:
but it would be nice if I could set the actual name as well, just to make my decoding/inference code a bit easier to read and so that it could seamlessly interoperate with different models.
In yolov7 they do this by directly passing input/output names to the torch.onnx.export
function https://github.com/WongKinYiu/yolov7/blob/84932d70fb9e2932d0a70e4a1f02a1d6dd1dd6ca/export.py#L159
It looks like super-gradients convert_to_onnx
DOES expose those arguments to us...
https://github.com/Deci-AI/super-gradients/blob/a43cfcd70072c7be1231f9183b1a717c136ff657/src/super_gradients/training/models/conversion.py#L169
ok I can confirm doing it with kwargs works:
model_path = "yolo_nas_s.onnx"
onnx_export_kwargs = {
'input_names' : ['images'],
'output_names' : ["num_dets", "det_boxes", "det_scores", "det_classes"]
}
models.convert_to_onnx(model=net, input_shape=(3,640,640), post_process=end2end, out_path=model_path,
torch_onnx_export_kwargs=onnx_export_kwargs)
Cool! No segfaults with that one?
can confirm no segfaults! no idea why it's different.
can confirm no segfaults! no idea why it's different.
I think its because you're forcing to rename the nodes without telling the graph that the nodes have changed (I dont know how to explain it but I hope you got the idea). From you previous code sample:
for i in onnx_model.graph.output:
i.name = names.pop(0)
for j in i.type.tensor_type.shape.dim:
j.dim_param = str(shapes.pop(0))
On the other hand, when we rename the outputs using the exporter, the graph already configured to have nodes with some names using torch_onnx_exporter_params
as shown in torch onnx example
If you want to rename a node in the onnx graph, you should create a new Variable
/Node
and attach it to the original onnx model outputs (the numbers 915,916) and update the graph, using onnx-graphsurgeon similar to this example
I've written inference code and after some experimenting everything is working, (ie. I get accurate looking bounding boxes) except that the class values are always reported as 0
when they should be in the range 0->80. I think it might have something to do with the parameter class_agnostic
which we do not pass in the code above^ and should be Set to true to do class-independent NMS; otherwise, boxes of different classes would be considered separately during NMS.
https://github.com/NVIDIA/TensorRT/tree/main/plugin/efficientNMSPlugin#parameters
But I can't workout how to pass that parameter correctly. If I pass class_agnostic_i=0,
to g.op
there is no behaviour change. If I pass any of: class_agnostic_i=1
, class_agnostic_i=-1
, class_agnostic_f=1
, class_agnostic_f=-1
. The onnx exports but does not load with trt.OnnxParser
without error.
I might be wrong on that front, but any help on this front would be greatly appreciated! Once it's working can publish solution publically for others.
@haritsahm @philipp-schmidt
Export Code: https://gist.github.com/LukeAI/bbfc3ab749601ab0f2cb06e4b8fc75cb
Inference Code: https://gist.github.com/LukeAI/336a1fd9ea802d454d883342517a681f when ran with an image of 4 forks, I see the four forks correctly boxed and this prints:
Total Running time = 0.0066 seconds
[0.888565 0.8684023 0.86700696 0.8414705 ] # det scores, looks about right
[0 0 0 0] # det classes, shouldn't be zero.
@LukeAI You should add class_agnostic
as the parameter in forward
and symbolic
functions and modify the TRT_NMS.apply()
arguments as needed.
def forward(
....
....
class_agnostic=1)
def symbolic(
....
....
class_agnostic=1)
I'm not sure, but I think you should pass class_agnostic_i
in the g.Op
.
I've never had incorrect results, except if I trained it wrong in the first place. Have you validated your torch model ouput? What is your performance metrics?
hey thanks! Turned out to be unrelated mistake in the forward function, have updated the linked gist to correct code in case it helps somebody.
Hey @LukeAI , @haritsahm @BloodAxe @mmax3 @philipp-schmidt ,
I am struggling to write the inference code for the yolonas onnx format .
Code:
input_name = session.get_inputs()[0].name
output_names = [x.name for x in session.get_outputs()]
ort_inputs = {input_name: im_np}
ort_outputs = session.run(output_names, ort_inputs)
Here ort_outputs is list where
ort_outputs[0] is of shape (1,N,4) and ort_outputs[1] is of shape (1,N,80) for coco dataset.
Can any of you please let me know , how to write the post processing function and the NMS handle !?
so I can get boxes and scores and class just similar to what we get when we do model.predict() in pytorch !?
Any help here will be deeply appreciated !!