ultralytics icon indicating copy to clipboard operation
ultralytics copied to clipboard

Standardize Onnx output format

Open thhart opened this issue 2 years ago • 3 comments

Search before asking

  • [X] I have searched the YOLOv8 issues and found no similar feature requests.

Description

Unfortunately the output layer in Onnx is different than YoloV5 and others. It is a small difference but leads to confusion and needs adaption in any code using the models. I am not an expert but obviously there are no standards for this which is a pity. Also documentation about input/output layers is very rare and not very common spread among Yolo scientists. At least it would be helpful to have some documentation here about the output format since often inference is done in different frameworks.

Here are some sample issues with obvious confusion:

https://github.com/ultralytics/ultralytics/issues/300 https://github.com/ultralytics/ultralytics/issues/221

Use case

It would help to spread the use of models and avoid adaption work.

Additional

No response

Are you willing to submit a PR?

  • [ ] Yes I'd like to help by submitting a PR!

thhart avatar Jan 14 '23 10:01 thhart

Totally agree. Currently I am trying to get the instance segmentation to work, but I can't find any documentation about the output format. Guessing and looking at float arrays shouldn't be the solution.

DavidBerschauer avatar Jan 17 '23 14:01 DavidBerschauer

ABSOLUTELY AGREE. Even if the formats between the two cannot be standardized, it would be immensely helpful to have a details explanation of what the output format mean and how to parse it. I am a NET developer, not Python. So my ability to look at the python scripts and see what it's doing is limited. It's been difficult for me to implement these models in a c#/Windows app because there is no good documentation.

sstainba avatar Jan 17 '23 18:01 sstainba

I've closed #457 with code (Python) I've used that works w/ both v5 and 8 ONNX exports. Hopefully this is helpful. Look at the two if model == "yolov8": to see what needs to be changed for reading the ONNX for object detection. I've not done segmentation so I cannot say if it will work for it as well. If not, hopefully it gets you closer to a solution.

knoppmyth avatar Jan 19 '23 23:01 knoppmyth

Output coming ot of YOLOv8 custom trained is something like of shape : outputs.shape : (1, 9, 1029)

Not sure array of arrays ,no conf score nothing and its for one object how can I determine which is what @knoppmyth

umanniyaz avatar Jan 25 '23 12:01 umanniyaz

Output coming ot of YOLOv8 custom trained is something like of shape : outputs.shape : (1, 9, 1029)

Not sure array of arrays ,no conf score nothing and its for one object how can I determine which is what @knoppmyth

the output is the [batch size, x1, y1, x2, y2, class1, class2, ... ,classN, total elements]. so it looks like your model has 5 classes, right? the second number will be 4 values for the x/y of the box and then one score for each class. but i'm pretty sure the last value (1029) should be evenly divisible by the dimensions (second number 9). so i'm not sure what's up with your model output.

sstainba avatar Jan 25 '23 13:01 sstainba

I now trained for 5 classess and ny inputs tensor that was sent to ORT session was of (1,3,224,224), and outputs I got multidimensional array '(1,9,1029), if its correct please share the piece of code snippet as how should i extract classess , cnf score and other thing from this multi d array:

Examples: array([[[6.8321829e+00, 1.2145002e+01, 1.6367720e+01, ..., 1.4589365e+02, 1.7720135e+02, 2.0850992e+02], [1.6865181e+01, 1.6190460e+01, 1.3565935e+01, ..., 2.1476869e+02, 2.1531071e+02, 2.1583066e+02], [1.3708321e+01, 2.3482452e+01, 3.0406818e+01, ..., 2.6386530e+02, 2.5891931e+02, 2.5001068e+02], ..., [5.6624413e-07, 2.3841858e-07, 8.9406967e-08, ..., 6.1094761e-06, 6.8545341e-06, 7.1823597e-06], [4.7683716e-07, 1.1920929e-07, 1.1920929e-07, ..., 6.0200691e-06, 6.7353249e-06, 7.0333481e-06], [7.4505806e-07, 1.1920929e-07, 5.9604645e-08, ..., 6.4671040e-06, 7.2121620e-06, 7.5399876e-06]]], dtype=float32)

@sstainba

umanniyaz avatar Jan 25 '23 20:01 umanniyaz

I only have one class and it's array of arrays It's confusing which is what

What do you mean "it's an array of arrays" ? You mean the output? Then yes... that's correct. It's a multi-dimensional array... AKA a Tensor. But if you only have 1 class, then the dimensions (second number) should be a 5. I have two custom models, both of which are a single class and both are [1, 5, xxxxx] in output shape.

sstainba avatar Jan 25 '23 20:01 sstainba

@sstainba Can you respond to updated comment

umanniyaz avatar Jan 27 '23 07:01 umanniyaz

I totally agree with you guys, every time there's something new and exciting to test out it takes an absolute journey to try and pull together the pieces and fragments of information surrounding what is actually meant to happen and how and then how to wrap it within C++ and it's just painful...

JustasBart avatar Jan 30 '23 13:01 JustasBart

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

  • Docs: https://docs.ultralytics.com
  • HUB: https://hub.ultralytics.com
  • Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

github-actions[bot] avatar Mar 22 '23 00:03 github-actions[bot]

@JustasBart i understand the frustration. Dealing with different output formats and lack of documentation can be quite challenging, especially when working with different programming languages like Python and C++. While I don't have the exact C++ code for parsing the output, I can share a general idea about the structure of the output array.

First off, the output shape (1, 9, 1029) indicates that your model expects a batch size of 1, has 9 anchors per grid cell, and produces an array of 1029 elements. Since you trained your model for 5 classes, the first 5 elements within each anchor's classification scores correspond to the confidence scores for those classes. Then, the subsequent 4 elements describe the bounding box coordinates (x1, y1, x2, y2) for each anchor.

To parse this output array, you'd extract the confidence scores for each class, choose the highest one as the predicted class for each anchor, and use the bounding box coordinates to draw the predicted boxes.

I hope this provides an initial insight into how to interpret the model's output. If you need further assistance or a more detailed explanation, feel free to ask!

pderrenger avatar Nov 16 '23 02:11 pderrenger

Hi @pderrenger and @umanniyaz,

I trained the best.pt model with 5 segments with the yolo v8 model. Then I converted the best.pt model to best.onnx format with python: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ from ultralytics import YOLO

Load a model

model_path = r'C:\Users\user\Desktop\Dogukan\best.pt' official_model_path = r'C:\Users\user\Desktop\Dogukan\yolov8m-seg.pt'

model = YOLO(official_model_path) # load an official model model = YOLO(model_path) # load a custom trained model

Export the model

model.export(format='onnx') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Then I ran the best.onnx model on the PNG file in Android Java and output the results to the "result" object. I give my codes below:

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ public void onn() throws OrtException { OrtEnvironment env = OrtEnvironment.getEnvironment(); OrtSession.SessionOptions opts = new OrtSession.SessionOptions(); OrtSession session = env.createSession("/storage/emulated/0/Android/data/com.bantas.debug/files/best.onnx", opts); try { Bitmap bitmap = BitmapFactory.decodeFile("/storage/emulated/0/Android/data/com.bantas.debug/files/BlickerImage_37712.png"); float[][][][] input = preprocessImage(bitmap);

OnnxTensor inputTensor = OnnxTensor.createTensor(env, input); OrtSession.Result result = session.run(Collections.singletonMap("images", inputTensor)); OnnxTensor outputTensor = (OnnxTensor) result.get(0);

float[][][] threeDimensionalArray = (float[][][]) result.get(0).getValue();

float[][] scores = threeDimensionalArray[0]; // Map<String, Float> scoreMap = (Map<String, Float>) result.get(1).getValue();

Gson gson = new Gson(); String scoresJson = gson.toJson(scores); // String scoreMapJson = gson.toJson(scoreMap);

// String deneme = processResult1(result); Object out = outputTensor.getValue();

// Gson gson = new Gson(); String json = gson.toJson(out); String a = json;

} catch (Exception e) { e.printStackTrace(); }

} ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ As the model output above, the json object is formed as follows:

json=[[[4.510956,17.391268,22.62502,27.104994 ... ...,0.6338775,0.54058886,0.36109298,0.062129557,0.043105423,0.3687974]]]

Using the JSON output from the ONNX model, what steps do I need to follow in Android to draw an image with certain bounding boxes and their class and confidence scores? How do we parse the JSON data representing the output of the model to get the coordinates, class information and confidence scores of the bounding boxes?

mrdoguoz avatar Nov 16 '23 18:11 mrdoguoz