models icon indicating copy to clipboard operation
models copied to clipboard

ArcFace model query

Open hariharans29 opened this issue 5 years ago • 45 comments

Hello,

Was setting the spatial attribute to 0 in the BatchNormalization nodes of the ArcFace intended ? A user notes that setting spatial=1 returns the right result as well. So trying to understand if setting spatial = 0 (the non-default value) for the opset 8 model an accident.

CC: @abhinavs95

Thanks!

hariharans29 avatar May 03 '19 01:05 hariharans29

@abhinavs95 Any update on this?

pranavsharma avatar May 20 '19 19:05 pranavsharma

Hi @hariharans29 @pranavsharma

The ArcFace model was prepared using MXNet and then converted to ONNX format using the MXNet to ONNX converter.

For BatchNorm, MXNet computes mean and variance per feature which is why we explicitly set spatial=0 when translating BatchNorm layers from MXNet to ONNX.

abhinavs95 avatar May 21 '19 19:05 abhinavs95

@abhinavs95 can this model be updated to use spatial=1? The ONNX standard has dropped support for spatial=0 from opset10 onwards and onnxruntime doesn't plan to support this.

pranavsharma avatar Jun 06 '19 19:06 pranavsharma

The spatial parameter is set to 0 in the MXNet to ONNX converter probably due to behavior of MXNet batchnorm: https://github.com/apache/incubator-mxnet/blob/745a41ca1a6d74a645911de8af46dece03db93ea/python/mxnet/contrib/onnx/mx2onnx/_op_translations.py#L357

I'll try to see if this model can be converted with spatial=1.

abhinavs95 avatar Jun 07 '19 23:06 abhinavs95

@abhinavs95 did you get a chance to address this? Thanks!

pranavsharma avatar Jun 19 '19 07:06 pranavsharma

@abhinavs95 any update on this? Thanks!

pranavsharma avatar Jul 03 '19 01:07 pranavsharma

@pranavsharma changing the spatial parameter cannot be done using the mxnet to onnx converter API as I had hoped, it requires modification of the source code. I am currently busy focussing on another project, I will provide an update when I get a chance to work on this.

abhinavs95 avatar Jul 03 '19 18:07 abhinavs95

@abhinavs95 any update on this? onnxruntime does not (and possibly will not) support spatial==0 on its CPU provider, making tensorrt-inference-server unable to load exported models (see here).

arsdragonfly avatar Aug 23 '19 03:08 arsdragonfly

There are more models on the ONNX model zoo with this bug: Yolov3 and Duc are also non-usable by ONNX Runtime for the same reason. When will this be fixed?

mathisdon avatar Sep 19 '19 16:09 mathisdon

Yolov3 is not impacted by this and has been successfully tested as-is.

Duc and ArcFace models need to be updated to a newer ONNX version. Hopefully @abhinavs95 can make the necessary modifications soon.

prasanthpul avatar Sep 19 '19 21:09 prasanthpul

Just to reiterate on this, even on GPU backend with ONNXRuntime (v0.4 or v0.5) the current model in the repository is producing wrong results the feature vector returned from the final fc layer are always NaN. I strongly suggest retiring this model and maybe replace it by a PyTorch version of the same thing until MXNet updates their ONNX exporter to latest specification

Mut1nyJD avatar Sep 23 '19 09:09 Mut1nyJD

@pranavsharma please tell me how to use yolov3(keras-to-onnx),I use it in tensorrt-inference-service get lots nan.

17702513221 avatar Sep 29 '19 15:09 17702513221

@Mut1nyJD can you contribute a replacement model please?

prasanthpul avatar Sep 29 '19 21:09 prasanthpul

@prasanthpul

Working on it currently training a new version using a PyTorch implementation (model seems to export into ONNX in general) from scratch with Ms1m dataset, But this is going to take a while since I have it on low priority.

Mut1nyJD avatar Sep 30 '19 08:09 Mut1nyJD

@Mut1nyJD Have you a runable arcface model?

luan1412167 avatar Oct 09 '19 07:10 luan1412167

@luan1412167

I am afraid I am stll training I have it at low priority that's why it takes time. Hopefully soon. I will check if an intermediate snapshot is exportable but I don't see why not.

Mut1nyJD avatar Oct 10 '19 13:10 Mut1nyJD

@Mut1nyJD whether arcface from pytorch to onnx have get right result

luan1412167 avatar Oct 10 '19 13:10 luan1412167

Hi guys,

I think I finally cracked this issue. I supported non-spatial mode in ORT in this PR - https://github.com/microsoft/onnxruntime/pull/2092 but it still won't run the ArcFace model in the ONNX zoo.

This is because the ArcFace modelis an invalid ONNX model because it violates the ONNX spec (https://github.com/onnx/onnx/blob/master/docs/Changelog.md#BatchNormalization-7). It has BatchNorm nodes with spatial == 0 but the input shapes don’t adhere to the required shape.

The spec says that the input shape should be ( C, D1, D2,…, Dn) for the inputs when spatial == 0:

image

But, in the model it has shape – [C]. This is only allowed for spatial == 1.

image

So, supporting non-spatial mode in ORT will not solve this problem. This is a bug in the MXNet exporter wherein it actually means spatial == 1 but still stamping the BatchNormalization node with spatial == 0. The output results are correct when we run the model assuming spatial == 1. So, the model doesn’t need re-conversion, it only needs an update in the model proto to make spatial == 1 in all the BN nodes and it will run correctly in ORT.

hariharans29 avatar Oct 11 '19 02:10 hariharans29

@hariharans29 you can try with this model here. this has spatial=0 and reshaped. I look forward to the result from you.

luan1412167 avatar Oct 11 '19 11:10 luan1412167

Hi @luan1412167 - I actually think it should be the opposite (spatial == 1).

hariharans29 avatar Oct 11 '19 18:10 hariharans29

Hi all,

I wrote a simple script to "correct" (not re-convert from base model) the ONNX model zoo ArcFace model from here - https://github.com/onnx/models/tree/master/vision/body_analysis/arcface.

This link contains the model (named resnet100.onnx) and test data. The script to correct the model is this (it is not possible to attach the corrected model as the size exceeds allowed limits)-

import onnx

model = onnx.load(r'arcface_mxnet\resnet100.onnx')

for node in model.graph.node:
    if(node.op_type == "BatchNormalization"):
        for attr in node.attribute:
            if (attr.name == "spatial"):
                attr.i = 1
                
onnx.save(model, r'updated_resnet100.onnx')

I checked the results in ONNXRuntime (using the test data provided in the same link) after correction and the result looks okay. Please use the corrected model if you have immediate inferencing needs.

hariharans29 avatar Oct 11 '19 19:10 hariharans29

@abhinavs95 can you comment on @hariharans29's findings and how this can be fixed from MXNet side? If it cannot, then we should apply the correction to the downloadable model and eventually replace it with a model from another framework.

prasanthpul avatar Oct 11 '19 19:10 prasanthpul

@hariharans29 I used updated_resnet100.onnx as your instruction above. Though it is run but the result seem to be wrong. Whether you can check again result model run on python and onnx runtime?

luan1412167 avatar Oct 14 '19 18:10 luan1412167

Did you use the official resnet100.onnx from the model zoo link or your converted model to make the update ?

I made the update on the official model and ran the test with all 3 test cases and the results are right.

As a double confirmation, another user made the same observation that making spatial == 1 in the same model here - https://github.com/Microsoft/onnxruntime/issues/831.

Quoting him - "By now I figured out that the model works correctly if you change the "spatial" attribute of all BatchNormalization nodes from 0 to 1. However, I'm not really sure why that helps".

I just gave an explanation above as to why that helps.

hariharans29 avatar Oct 14 '19 18:10 hariharans29

I just downloaded the arcface model again from https://github.com/onnx/models/tree/master/vision/body_analysis/arcface, using the link called "248.9 MB" in the "Download" column, and ONNX Runtime still reports the same problem:

RuntimeError: [ONNXRuntimeError] : 1 : GENERAL ERROR : Exception during initialization: D:\3\s\onnxruntime\core/providers/cpu/nn/batch_norm.h:39 onnxruntime::BatchNorm::BatchNorm spatial == 1 was false. BatchNormalization kernel for CPU provider does not support non-spatial cases

mathisdon avatar Oct 14 '19 18:10 mathisdon

Hi @mathisdon ,

The model doesn't require spatial == 0. Can you please make the update to the model as suggested above and try running it ?

hariharans29 avatar Oct 14 '19 18:10 hariharans29

Hi @hariharans29, I have downloaded model from model zoo and run your script to change spatial 0->1. this is model link here I tried with 2 different images but I get cosine distance = 0.96. So I think it is wrong( Because 2 different images must to get cosine distance of embbeding is small). Can you share script evaluate the model? images Tom_Hanks_54745

luan1412167 avatar Oct 15 '19 03:10 luan1412167

Hi,

I did not use a script. I used the onnx test runner tool in the OnnxRuntime repo. It has the capability to consume input tensor protobufs and output tensor protobufs and compare results after tests. I downloaded the 3 test cases in the on x model zoo link (download with test data) and used the onnx test runner tool to run each test case and the output is correct.

What is the exact numerical cosine distance value you expect ? The definition of "wrong" results seems ridden with some hidden assumptions.

hariharans29 avatar Oct 15 '19 03:10 hariharans29

@hariharans29 can you check my model? here

I compute consine distance between two embbedings. if those two embbedings is a person that consine distance will near 1 opposite cosine distance will small and near 0.

luan1412167 avatar Oct 15 '19 03:10 luan1412167

Hi,

I think it is the exact opposite. It is cosine "distance" (not similarity). When two people are different, cosine distance will near 1 and when they are the same, the value nears 0.

hariharans29 avatar Oct 15 '19 03:10 hariharans29