models
models copied to clipboard
ArcFace model query
Hello,
Was setting the spatial
attribute to 0 in the BatchNormalization nodes of the ArcFace intended ? A user notes that setting spatial=1 returns the right result as well. So trying to understand if setting spatial = 0
(the non-default value) for the opset 8 model an accident.
CC: @abhinavs95
Thanks!
@abhinavs95 Any update on this?
Hi @hariharans29 @pranavsharma
The ArcFace model was prepared using MXNet and then converted to ONNX format using the MXNet to ONNX converter.
For BatchNorm, MXNet computes mean and variance per feature which is why we explicitly set spatial=0
when translating BatchNorm layers from MXNet to ONNX.
@abhinavs95 can this model be updated to use spatial=1? The ONNX standard has dropped support for spatial=0 from opset10 onwards and onnxruntime doesn't plan to support this.
The spatial parameter is set to 0 in the MXNet to ONNX converter probably due to behavior of MXNet batchnorm: https://github.com/apache/incubator-mxnet/blob/745a41ca1a6d74a645911de8af46dece03db93ea/python/mxnet/contrib/onnx/mx2onnx/_op_translations.py#L357
I'll try to see if this model can be converted with spatial=1.
@abhinavs95 did you get a chance to address this? Thanks!
@abhinavs95 any update on this? Thanks!
@pranavsharma changing the spatial parameter cannot be done using the mxnet to onnx converter API as I had hoped, it requires modification of the source code. I am currently busy focussing on another project, I will provide an update when I get a chance to work on this.
@abhinavs95 any update on this? onnxruntime does not (and possibly will not) support spatial==0 on its CPU provider, making tensorrt-inference-server unable to load exported models (see here).
There are more models on the ONNX model zoo with this bug: Yolov3 and Duc are also non-usable by ONNX Runtime for the same reason. When will this be fixed?
Yolov3 is not impacted by this and has been successfully tested as-is.
Duc and ArcFace models need to be updated to a newer ONNX version. Hopefully @abhinavs95 can make the necessary modifications soon.
Just to reiterate on this, even on GPU backend with ONNXRuntime (v0.4 or v0.5) the current model in the repository is producing wrong results the feature vector returned from the final fc layer are always NaN. I strongly suggest retiring this model and maybe replace it by a PyTorch version of the same thing until MXNet updates their ONNX exporter to latest specification
@pranavsharma please tell me how to use yolov3(keras-to-onnx),I use it in tensorrt-inference-service get lots nan.
@Mut1nyJD can you contribute a replacement model please?
@prasanthpul
Working on it currently training a new version using a PyTorch implementation (model seems to export into ONNX in general) from scratch with Ms1m dataset, But this is going to take a while since I have it on low priority.
@Mut1nyJD Have you a runable arcface model?
@luan1412167
I am afraid I am stll training I have it at low priority that's why it takes time. Hopefully soon. I will check if an intermediate snapshot is exportable but I don't see why not.
@Mut1nyJD whether arcface from pytorch to onnx have get right result
Hi guys,
I think I finally cracked this issue. I supported non-spatial mode in ORT in this PR - https://github.com/microsoft/onnxruntime/pull/2092 but it still won't run the ArcFace model in the ONNX zoo.
This is because the ArcFace modelis an invalid ONNX model because it violates the ONNX spec (https://github.com/onnx/onnx/blob/master/docs/Changelog.md#BatchNormalization-7). It has BatchNorm nodes with spatial == 0 but the input shapes don’t adhere to the required shape.
The spec says that the input shape should be ( C, D1, D2,…, Dn) for the inputs when spatial == 0:
But, in the model it has shape – [C]. This is only allowed for spatial == 1.
So, supporting non-spatial mode in ORT will not solve this problem. This is a bug in the MXNet exporter wherein it actually means spatial == 1 but still stamping the BatchNormalization node with spatial == 0. The output results are correct when we run the model assuming spatial == 1. So, the model doesn’t need re-conversion, it only needs an update in the model proto to make spatial == 1 in all the BN nodes and it will run correctly in ORT.
@hariharans29 you can try with this model here. this has spatial=0 and reshaped. I look forward to the result from you.
Hi @luan1412167 - I actually think it should be the opposite (spatial == 1).
Hi all,
I wrote a simple script to "correct" (not re-convert from base model) the ONNX model zoo ArcFace model from here - https://github.com/onnx/models/tree/master/vision/body_analysis/arcface.
This link contains the model (named resnet100.onnx) and test data. The script to correct the model is this (it is not possible to attach the corrected model as the size exceeds allowed limits)-
import onnx
model = onnx.load(r'arcface_mxnet\resnet100.onnx')
for node in model.graph.node:
if(node.op_type == "BatchNormalization"):
for attr in node.attribute:
if (attr.name == "spatial"):
attr.i = 1
onnx.save(model, r'updated_resnet100.onnx')
I checked the results in ONNXRuntime (using the test data provided in the same link) after correction and the result looks okay. Please use the corrected model if you have immediate inferencing needs.
@abhinavs95 can you comment on @hariharans29's findings and how this can be fixed from MXNet side? If it cannot, then we should apply the correction to the downloadable model and eventually replace it with a model from another framework.
@hariharans29 I used updated_resnet100.onnx as your instruction above. Though it is run but the result seem to be wrong. Whether you can check again result model run on python and onnx runtime?
Did you use the official resnet100.onnx from the model zoo link or your converted model to make the update ?
I made the update on the official model and ran the test with all 3 test cases and the results are right.
As a double confirmation, another user made the same observation that making spatial == 1 in the same model here - https://github.com/Microsoft/onnxruntime/issues/831.
Quoting him - "By now I figured out that the model works correctly if you change the "spatial" attribute of all BatchNormalization nodes from 0 to 1. However, I'm not really sure why that helps".
I just gave an explanation above as to why that helps.
I just downloaded the arcface model again from https://github.com/onnx/models/tree/master/vision/body_analysis/arcface, using the link called "248.9 MB" in the "Download" column, and ONNX Runtime still reports the same problem:
RuntimeError: [ONNXRuntimeError] : 1 : GENERAL ERROR : Exception during initialization: D:\3\s\onnxruntime\core/providers/cpu/nn/batch_norm.h:39 onnxruntime::BatchNorm
Hi @mathisdon ,
The model doesn't require spatial == 0. Can you please make the update to the model as suggested above and try running it ?
Hi @hariharans29,
I have downloaded model from model zoo and run your script to change spatial 0->1.
this is model link here
I tried with 2 different images but I get cosine distance = 0.96. So I think it is wrong( Because 2 different images must to get cosine distance of embbeding is small). Can you share script evaluate the model?
Hi,
I did not use a script. I used the onnx test runner tool in the OnnxRuntime repo. It has the capability to consume input tensor protobufs and output tensor protobufs and compare results after tests. I downloaded the 3 test cases in the on x model zoo link (download with test data) and used the onnx test runner tool to run each test case and the output is correct.
What is the exact numerical cosine distance value you expect ? The definition of "wrong" results seems ridden with some hidden assumptions.
@hariharans29 can you check my model? here
I compute consine distance between two embbedings. if those two embbedings is a person that consine distance will near 1 opposite cosine distance will small and near 0.
Hi,
I think it is the exact opposite. It is cosine "distance" (not similarity). When two people are different, cosine distance will near 1 and when they are the same, the value nears 0.