nncf
nncf copied to clipboard
Add accuracy acceptance criteria to onnx e2e test
Changes
- Generate HTML report for ONNX e2e test
- Add model accuracy degradation acceptance criteria for ONNX e2e test
Reason for changes
To keep alignment with other e2e tests.
Related tickets
88625
Tests
retest this please
Please refer to this build NNCF/job/nightly/job/e2e_onnx_model_zoo/22/
about this change.
Please refer to this build NNCF/job/nightly/job/e2e_onnx_model_zoo/22/ about this change.
Thanks for the link. Could you make the report in the same style as all e2e builds?
Please refer to this build NNCF/job/nightly/job/e2e_onnx_model_zoo/22/ about this change.
Thanks for the link. Could you make the report in the same style as all e2e builds?
Please clarify your request. Which one do you mean the table schema, cell color formatting, or text font?
Which one do you mean the table schema, cell color formatting, or text font?
I mean something like that https://github.com/openvinotoolkit/nncf/blob/cf8784f3811c40c3aa99ba63593638589472542b/tests/tensorflow/test_sota_checkpoints.py#L200
@vinnamkim , do you plan to finalize this PR?
@vinnamkim , do you plan to finalize this PR?
@alexsu52 @MaximProshin
Please clarify your requirement to merge this PR. I know that there are existing reports for PyTorch and Tensorflow, e.g. nncf/tests/tensorflow/test_sota_checkpoints.py But, they have basically different table scheme compared to the ONNX report. Which do you want to make coherence with them? Table formatting (font style, color, etc...)? Or, you can just merge this PR and make an improvement in the future.
@vinnamkim , do you plan to finalize this PR?
@alexsu52 @MaximProshin
Please clarify your requirement to merge this PR. I know that there are existing reports for PyTorch and Tensorflow, e.g. nncf/tests/tensorflow/test_sota_checkpoints.py But, they have basically different table scheme compared to the ONNX report. Which do you want to make coherence with them? Table formatting (font style, color, etc...)? Or, you can just merge this PR and make an improvement in the future.
Hi @vinnamkim!
We would like to have the same table style like torch or tf: if the model passes (accuracies are expected) it is green if it is not this is yellow one, the red color if there was an error occurred during quantization or difference between int8 and fp32 more than threshold. Such table is easy to understand If you need any help please contact me or @vshampor directly. Thanks!
Model | Metrics type | Expected FP32 | FP32 | INT8 | Diff | Diff Expected |
---|---|---|---|---|---|---|
-- | -- | -- | -- | -- | -- | -- |
@vinnamkim I tend to agree with the rest of the team on the formatting and creation of the accuracy results table. The ONNX version should be consistent with the PT/TF versions - both in the configuration file format (i.e. JSON would be preferrable to CSV - follow the format in PT, for example) and the process of building the resulting HTML. If you can rewrite the TF/PT HTML building process using pandas so that it is more concise and readable, please do so in this PR or in a separate (in the latter case the separate PR we would have to test and merge prior to this one, though).
Or, you can just merge this PR and make an improvement in the future.
Surely you would agree that following these kinds of arguments would not make for a sustainable, continuously improving state of the repository.
@vinnamkim , do you plan to finalize this PR?
@alexsu52 @MaximProshin Please clarify your requirement to merge this PR. I know that there are existing reports for PyTorch and Tensorflow, e.g. nncf/tests/tensorflow/test_sota_checkpoints.py But, they have basically different table scheme compared to the ONNX report. Which do you want to make coherence with them? Table formatting (font style, color, etc...)? Or, you can just merge this PR and make an improvement in the future.
Hi @vinnamkim!
We would like to have the same table style like torch or tf: if the model passes (accuracies are expected) it is green if it is not this is yellow one, the red color if there was an error occurred during quantization or difference between int8 and fp32 more than threshold. Such table is easy to understand If you need any help please contact me or @vshampor directly. Thanks!
Model Metrics type Expected FP32 FP32 INT8 Diff Diff Expected
Hi, @kshpv @vshampor I updated the table scheme and style as same as you said.
@vinnamkim , do you plan to finalize this PR?
@alexsu52 @MaximProshin Please clarify your requirement to merge this PR. I know that there are existing reports for PyTorch and Tensorflow, e.g. nncf/tests/tensorflow/test_sota_checkpoints.py But, they have basically different table scheme compared to the ONNX report. Which do you want to make coherence with them? Table formatting (font style, color, etc...)? Or, you can just merge this PR and make an improvement in the future.
Hi @vinnamkim! We would like to have the same table style like torch or tf: if the model passes (accuracies are expected) it is green if it is not this is yellow one, the red color if there was an error occurred during quantization or difference between int8 and fp32 more than threshold. Such table is easy to understand If you need any help please contact me or @vshampor directly. Thanks! Model Metrics type Expected FP32 FP32 INT8 Diff Diff Expected
Hi, @kshpv @vshampor I updated the table scheme and style as same as you said.
For me it looks okay. @vshampor, please take a look
@vshampor @vinnamkim could you please contact each other and finalize this PR?
@vinnamkim it's pretty hard to review this if you won't at least post a screenshot of what the table actually looks like when rendered.
You also haven't addressed my other concerns:
- See
tests.torch.test_sota_checkpoints.TestSotaCheckpoints.write_results_table
for the actual code that does HTML table rendering for the PT accuracy test, andtests.tensorflow.test_sota_checkpoints.RunTest.write_results_table
for the TF respectively. The way the table is built there is different from what you are proposing. Please, reuseyattag
'sDoc
class for markup-as-code in your solution, andPrettyTable
for building and rendering the table. If you think that using a pandas dataframe is easier and more beautiful than using aPrettyTable
, then please modify the PT and TF table-building code so that they also use pandas. Reuse code and extract common (across frameworks) parts of the HTML rendering code totests/common
where possible.- What purpose do the
accuracy_checker-reference.csv
files serve? If these are for storing accuracy reference values, then this is not the way we do this. We instead store only the values that are actually necessary for us in a structured manner intests/tensorflow/sota_checkpoints_eval.json
andtests/torch/sota_checkpoints_eval.json
. Make yourself acquainted with the structure of these files and set up the ONNX-specific JSON file with the same structure for the models in your PR.
Firstly, you can see ONNX E2E model zoo build 36. Secondly, let me tell my concerns for your claims:
- I'm not familiar with using libraries like
prettytable
andyattag
. I don't have much resources to change this PR to use those libraries. In addition, it's hard to agree why this PR should use those libraries since both PT and TF have their own implementation for each. They don't already have the common interface code. If we proceed to make a common interface to generate a report for all frameworks, I don't think that this PR is appropriate for that work. This is because I basically think that PR should be small as possible. - I agree.
Thus, my suggestion is fixing Claim 2 and conclude this PR. What do you think?
@vinnamkim let's have a quick comparison with the expected format for the rest of the E2E (on the TF example, ignore the "tfds" section for simplicity):
and yours:

- Only the cells with actual numbers are supposed to be colored. Yours are colored throughout, including the header cells.
- The E2E result tables have separate sections specifying the results obtained while inferring the original framework ("tfrecords" in our TF E2E table) and while inferring via OpenVINO ("OpenVino" in our TF E2E table). Can you please tell whether the results in your table are obtained using
onnxruntime
inference or OpenVINO Inference Engine inference? If former, then this actually shouldn't be called an E2E test, since an E2E test involves running the model in OpenVINO and assessing its accuracy in the end. If latter, then at least put the current results into the "OpenVINO" section as it is done in the TF E2E table. - The "Expected FP32" column shouldn't really be filled entirely with "None", don't you agree? There would be no reason for the entire column otherwise. For the TF and PT E2E we have an "Expected" column, which you should keep instead, and which reflects the expected, reference accuracy metric for the associated model against which the measured accuracy is compared to with respect to the threshold.
- The "Diff" column is not disambiguated. Is this a difference between currently measured INT8 and FP32 accuracy metrics, or is it a difference between measured INT8 accuracy and expected INT8 accuracy, or is it a difference between measured FP32 accuracy and expected FP32 accuracy? Our "Diff FP32" and "Diff Expected" are better in this account and have self-explanatory names.
- Your "Diff Expected" column shows the static "<1%" value, which is uninformative and probably should not warrant a column of its own at all, if it is going to be one and the same value for any measurement result. Instead, follow the logic that our "Diff Expected" uses - it shows the difference of the currently measured accuracy (be it for the INT8 or for an FP32 model) to the current reference value for the same model.
- "nan" values should probably not be shown to the user. If there is no metric to be shown, use a "-" dash symbol instead.
- The coloring logic is not consistent with what we have in other similar tables. Please review
tests.tensorflow.test_sota_checkpoints.TestSotaCheckpoints.threshold_check
for the definition of current logic.
As for the yattag
vs pandas
- I've always thought that taking the existing code and trying to apply it in your scenario is easier than writing your own code. Our existing code, although far from perfect, is already producing tables that have all the info we need. Refactoring all of the framework E2E tests to use the same code for table building is outside of the scope of this PR, of course, and we won't ask it of you, but surely you would agree that it is easier for us to do the code unification if the code is actually similar everywhere we want to unify it.
I could probably let it slide for now, and accept your PR with pandas as it is, but only if you fix the mismatches with the format we've got in the rest of our E2E tests that I've outlined in the list above.
@vshampor
Claim 1, 5, 6, 7 are fixed.
For Claim 2, I didn't know whether NNCF ONNX PTQ pipeline expects to request users to execute inferences on OpenVINO Inference Engine (OVIE). I thought that ONNX PTQ pipeline requires user to handle their model just only on ONNXRuntime-ExecutionProvider (OVEP). Therefore, I designed a test which takes an ONNX model as an input, produces the quantized ONNX model, and compares the model accuracy between them. These procedures use OVEP only, not OVIE. If this test is not required, we can close this PR.
For Claim 3 and 4, I updated Table schema to make it similar with TF and PT as possible as I can. Therefore, those columns will be as follows.
-
Expected FP32
: The desired model accuracy. This will only be known by the original model author or trainer, and it is normally reported from https://github.com/onnx/models. -
FP32
: The model accuracy comes from our OVEP FP32 inferencing result with accuracy checker. -
INT8
: The model accuracy comes from our OVEP INT8 inferencing result for quantized model with accuracy checker. -
Diff FP32
:INT8
-FP32
-
Diff Expected
:INT8
-Expected FP32
But, I give Expected FP32
same value with FP32
for now. This is because it is very hard to reproduce the model accuracy reported in https://github.com/onnx/models, especially for detection models. Input pre-processing rules, post-processing rules, and metric computation implementation details can affect to the model accuracy. But, those information is not faithful in https://github.com/onnx/models. Therefore, So so far, I've tried to make FP32
values as close to the reported values as possible, and focused on comparing FP32
to INT8
. Anyway, the failure of PTQ makes a big difference between INT8
and FP32
, so I focused on making a tool to verify the failure of PTQ.
In addition, I designed to use pre-computed FP32
values written in the JSON file (you can re-compute it anytime with e2e_eval_original_model
pytest marker, https://github.com/openvinotoolkit/nncf/tree/develop/tests/onnx#test-for-onnx-framework-guide), rather than computing them everytime. This is because this test requires ~18 hours using ~30 CPU cores. Then, we can expect additional 36 hours to compute FP32
. But, FP32
will never change until we change OVEP dependency version: https://github.com/openvinotoolkit/nncf/blob/26ea9fc75705a1ee705e9ef10684846cc8d9bd87/setup.py#L133
So I thought it would be a huge waste to compute FP32
every time.
To see the changed report after fixing, please see build 63 artifacts.

@vinnamkim thanks, this looks better. A couple of more fixes and I'll merge this.
- INT8 accuracy is almost everywhere better than FP32. It is also different from your previous run. This can't be right, and if it is right, we need an explanation for how that is possible.
- The legend that describes the meaning of colors is missing. You can copy-and-paste the legend that we have in other E2E tests for this.
- If the code do not recalculate FP32 values in a given test run, then please make it so that the corresponding valus in the "FP32" column are given in parentheses (i.e.
(52.02)
) so that it is immediately obvious from the table which values have been taken from the "cache". We do this in other E2E tests as well. Also please add an entry in the legend that describes what these parenthesized values mean (i.e. as I've described above, that these values have not been computed during this run, but taken from cache instead. Might also want to set up a check for the current onnxruntime-openvino version so that the report will also say whether these values are outdated and should be regenerated.) - Since this is all OVEP results, please set up the table in such a manner that above the accuracy columns there is a merged cell saying
OVEP
, in the same manner as it is done for TF with theirtfrecords
designation:
![]()
@vinnamkim thanks, this looks better. A couple of more fixes and I'll merge this.
- INT8 accuracy is almost everywhere better than FP32. It is also different from your previous run. This can't be right, and if it is right, we need an explanation for how that is possible.
- The legend that describes the meaning of colors is missing. You can copy-and-paste the legend that we have in other E2E tests for this.
- If the code do not recalculate FP32 values in a given test run, then please make it so that the corresponding valus in the "FP32" column are given in parentheses (i.e.
(52.02)
) so that it is immediately obvious from the table which values have been taken from the "cache". We do this in other E2E tests as well. Also please add an entry in the legend that describes what these parenthesized values mean (i.e. as I've described above, that these values have not been computed during this run, but taken from cache instead. Might also want to set up a check for the current onnxruntime-openvino version so that the report will also say whether these values are outdated and should be regenerated.)- Since this is all OVEP results, please set up the table in such a manner that above the accuracy columns there is a merged cell saying
OVEP
, in the same manner as it is done for TF with theirtfrecords
designation:![]()
Fix Claim 2,3,4. For Claim 1, this is because Build 63 was a short test with the subsets of evaluation datasets.
Build 36 | Build 63 | |
---|---|---|
Eval size | ![]() |
![]() |
I re-launched Build 65 with full evaluation datasets to match the numbers with the previous one.
run pytorch pre-commit tests