nncf icon indicating copy to clipboard operation
nncf copied to clipboard

Add accuracy acceptance criteria to onnx e2e test

Open vinnamkim opened this issue 2 years ago • 5 comments

Changes

  • Generate HTML report for ONNX e2e test
  • Add model accuracy degradation acceptance criteria for ONNX e2e test

Reason for changes

To keep alignment with other e2e tests.

Related tickets

88625

Tests

vinnamkim avatar Aug 03 '22 07:08 vinnamkim

retest this please

vinnamkim avatar Aug 04 '22 09:08 vinnamkim

Please refer to this build NNCF/job/nightly/job/e2e_onnx_model_zoo/22/ about this change.

vinnamkim avatar Aug 08 '22 01:08 vinnamkim

Please refer to this build NNCF/job/nightly/job/e2e_onnx_model_zoo/22/ about this change.

Thanks for the link. Could you make the report in the same style as all e2e builds?

alexsu52 avatar Aug 08 '22 05:08 alexsu52

Please refer to this build NNCF/job/nightly/job/e2e_onnx_model_zoo/22/ about this change.

Thanks for the link. Could you make the report in the same style as all e2e builds?

Please clarify your request. Which one do you mean the table schema, cell color formatting, or text font?

vinnamkim avatar Aug 08 '22 08:08 vinnamkim

Which one do you mean the table schema, cell color formatting, or text font?

I mean something like that https://github.com/openvinotoolkit/nncf/blob/cf8784f3811c40c3aa99ba63593638589472542b/tests/tensorflow/test_sota_checkpoints.py#L200

alexsu52 avatar Aug 09 '22 13:08 alexsu52

@vinnamkim , do you plan to finalize this PR?

MaximProshin avatar Aug 17 '22 11:08 MaximProshin

@vinnamkim , do you plan to finalize this PR?

@alexsu52 @MaximProshin

Please clarify your requirement to merge this PR. I know that there are existing reports for PyTorch and Tensorflow, e.g. nncf/tests/tensorflow/test_sota_checkpoints.py But, they have basically different table scheme compared to the ONNX report. Which do you want to make coherence with them? Table formatting (font style, color, etc...)? Or, you can just merge this PR and make an improvement in the future.

vinnamkim avatar Aug 18 '22 02:08 vinnamkim

@vinnamkim , do you plan to finalize this PR?

@alexsu52 @MaximProshin

Please clarify your requirement to merge this PR. I know that there are existing reports for PyTorch and Tensorflow, e.g. nncf/tests/tensorflow/test_sota_checkpoints.py But, they have basically different table scheme compared to the ONNX report. Which do you want to make coherence with them? Table formatting (font style, color, etc...)? Or, you can just merge this PR and make an improvement in the future.

Hi @vinnamkim!

We would like to have the same table style like torch or tf: if the model passes (accuracies are expected) it is green if it is not this is yellow one, the red color if there was an error occurred during quantization or difference between int8 and fp32 more than threshold. Such table is easy to understand If you need any help please contact me or @vshampor directly. Thanks!

Model Metrics type Expected FP32 FP32 INT8 Diff Diff Expected
-- -- -- -- -- -- --

kshpv avatar Aug 18 '22 09:08 kshpv

@vinnamkim I tend to agree with the rest of the team on the formatting and creation of the accuracy results table. The ONNX version should be consistent with the PT/TF versions - both in the configuration file format (i.e. JSON would be preferrable to CSV - follow the format in PT, for example) and the process of building the resulting HTML. If you can rewrite the TF/PT HTML building process using pandas so that it is more concise and readable, please do so in this PR or in a separate (in the latter case the separate PR we would have to test and merge prior to this one, though).

Or, you can just merge this PR and make an improvement in the future.

Surely you would agree that following these kinds of arguments would not make for a sustainable, continuously improving state of the repository.

vshampor avatar Aug 18 '22 09:08 vshampor

@vinnamkim , do you plan to finalize this PR?

@alexsu52 @MaximProshin Please clarify your requirement to merge this PR. I know that there are existing reports for PyTorch and Tensorflow, e.g. nncf/tests/tensorflow/test_sota_checkpoints.py But, they have basically different table scheme compared to the ONNX report. Which do you want to make coherence with them? Table formatting (font style, color, etc...)? Or, you can just merge this PR and make an improvement in the future.

Hi @vinnamkim!

We would like to have the same table style like torch or tf: if the model passes (accuracies are expected) it is green if it is not this is yellow one, the red color if there was an error occurred during quantization or difference between int8 and fp32 more than threshold. Such table is easy to understand If you need any help please contact me or @vshampor directly. Thanks!

Model Metrics type Expected FP32 FP32 INT8 Diff Diff Expected


Hi, @kshpv @vshampor I updated the table scheme and style as same as you said.

vinnamkim avatar Aug 24 '22 01:08 vinnamkim

@vinnamkim , do you plan to finalize this PR?

@alexsu52 @MaximProshin Please clarify your requirement to merge this PR. I know that there are existing reports for PyTorch and Tensorflow, e.g. nncf/tests/tensorflow/test_sota_checkpoints.py But, they have basically different table scheme compared to the ONNX report. Which do you want to make coherence with them? Table formatting (font style, color, etc...)? Or, you can just merge this PR and make an improvement in the future.

Hi @vinnamkim! We would like to have the same table style like torch or tf: if the model passes (accuracies are expected) it is green if it is not this is yellow one, the red color if there was an error occurred during quantization or difference between int8 and fp32 more than threshold. Such table is easy to understand If you need any help please contact me or @vshampor directly. Thanks! Model Metrics type Expected FP32 FP32 INT8 Diff Diff Expected

Hi, @kshpv @vshampor I updated the table scheme and style as same as you said.

For me it looks okay. @vshampor, please take a look

kshpv avatar Aug 24 '22 11:08 kshpv

@vshampor @vinnamkim could you please contact each other and finalize this PR?

kshpv avatar Sep 08 '22 09:09 kshpv

@vinnamkim it's pretty hard to review this if you won't at least post a screenshot of what the table actually looks like when rendered.

You also haven't addressed my other concerns:

  1. See tests.torch.test_sota_checkpoints.TestSotaCheckpoints.write_results_table for the actual code that does HTML table rendering for the PT accuracy test, and tests.tensorflow.test_sota_checkpoints.RunTest.write_results_table for the TF respectively. The way the table is built there is different from what you are proposing. Please, reuse yattag's Doc class for markup-as-code in your solution, and PrettyTable for building and rendering the table. If you think that using a pandas dataframe is easier and more beautiful than using a PrettyTable, then please modify the PT and TF table-building code so that they also use pandas. Reuse code and extract common (across frameworks) parts of the HTML rendering code to tests/common where possible.
  2. What purpose do the accuracy_checker-reference.csv files serve? If these are for storing accuracy reference values, then this is not the way we do this. We instead store only the values that are actually necessary for us in a structured manner in tests/tensorflow/sota_checkpoints_eval.json and tests/torch/sota_checkpoints_eval.json. Make yourself acquainted with the structure of these files and set up the ONNX-specific JSON file with the same structure for the models in your PR.

Firstly, you can see ONNX E2E model zoo build 36. Secondly, let me tell my concerns for your claims:

  1. I'm not familiar with using libraries like prettytable and yattag. I don't have much resources to change this PR to use those libraries. In addition, it's hard to agree why this PR should use those libraries since both PT and TF have their own implementation for each. They don't already have the common interface code. If we proceed to make a common interface to generate a report for all frameworks, I don't think that this PR is appropriate for that work. This is because I basically think that PR should be small as possible.
  2. I agree.

Thus, my suggestion is fixing Claim 2 and conclude this PR. What do you think?

vinnamkim avatar Sep 14 '22 04:09 vinnamkim

@vinnamkim let's have a quick comparison with the expected format for the rest of the E2E (on the TF example, ignore the "tfds" section for simplicity): image

and yours:

image
  1. Only the cells with actual numbers are supposed to be colored. Yours are colored throughout, including the header cells.
  2. The E2E result tables have separate sections specifying the results obtained while inferring the original framework ("tfrecords" in our TF E2E table) and while inferring via OpenVINO ("OpenVino" in our TF E2E table). Can you please tell whether the results in your table are obtained using onnxruntime inference or OpenVINO Inference Engine inference? If former, then this actually shouldn't be called an E2E test, since an E2E test involves running the model in OpenVINO and assessing its accuracy in the end. If latter, then at least put the current results into the "OpenVINO" section as it is done in the TF E2E table.
  3. The "Expected FP32" column shouldn't really be filled entirely with "None", don't you agree? There would be no reason for the entire column otherwise. For the TF and PT E2E we have an "Expected" column, which you should keep instead, and which reflects the expected, reference accuracy metric for the associated model against which the measured accuracy is compared to with respect to the threshold.
  4. The "Diff" column is not disambiguated. Is this a difference between currently measured INT8 and FP32 accuracy metrics, or is it a difference between measured INT8 accuracy and expected INT8 accuracy, or is it a difference between measured FP32 accuracy and expected FP32 accuracy? Our "Diff FP32" and "Diff Expected" are better in this account and have self-explanatory names.
  5. Your "Diff Expected" column shows the static "<1%" value, which is uninformative and probably should not warrant a column of its own at all, if it is going to be one and the same value for any measurement result. Instead, follow the logic that our "Diff Expected" uses - it shows the difference of the currently measured accuracy (be it for the INT8 or for an FP32 model) to the current reference value for the same model.
  6. "nan" values should probably not be shown to the user. If there is no metric to be shown, use a "-" dash symbol instead.
  7. The coloring logic is not consistent with what we have in other similar tables. Please review tests.tensorflow.test_sota_checkpoints.TestSotaCheckpoints.threshold_check for the definition of current logic.

As for the yattag vs pandas - I've always thought that taking the existing code and trying to apply it in your scenario is easier than writing your own code. Our existing code, although far from perfect, is already producing tables that have all the info we need. Refactoring all of the framework E2E tests to use the same code for table building is outside of the scope of this PR, of course, and we won't ask it of you, but surely you would agree that it is easier for us to do the code unification if the code is actually similar everywhere we want to unify it.

I could probably let it slide for now, and accept your PR with pandas as it is, but only if you fix the mismatches with the format we've got in the rest of our E2E tests that I've outlined in the list above.

vshampor avatar Sep 14 '22 12:09 vshampor

@vshampor

Claim 1, 5, 6, 7 are fixed.

For Claim 2, I didn't know whether NNCF ONNX PTQ pipeline expects to request users to execute inferences on OpenVINO Inference Engine (OVIE). I thought that ONNX PTQ pipeline requires user to handle their model just only on ONNXRuntime-ExecutionProvider (OVEP). Therefore, I designed a test which takes an ONNX model as an input, produces the quantized ONNX model, and compares the model accuracy between them. These procedures use OVEP only, not OVIE. If this test is not required, we can close this PR.

For Claim 3 and 4, I updated Table schema to make it similar with TF and PT as possible as I can. Therefore, those columns will be as follows.

  1. Expected FP32 : The desired model accuracy. This will only be known by the original model author or trainer, and it is normally reported from https://github.com/onnx/models.
  2. FP32: The model accuracy comes from our OVEP FP32 inferencing result with accuracy checker.
  3. INT8: The model accuracy comes from our OVEP INT8 inferencing result for quantized model with accuracy checker.
  4. Diff FP32: INT8 - FP32
  5. Diff Expected: INT8 - Expected FP32

But, I give Expected FP32 same value with FP32 for now. This is because it is very hard to reproduce the model accuracy reported in https://github.com/onnx/models, especially for detection models. Input pre-processing rules, post-processing rules, and metric computation implementation details can affect to the model accuracy. But, those information is not faithful in https://github.com/onnx/models. Therefore, So so far, I've tried to make FP32 values as close to the reported values as possible, and focused on comparing FP32 to INT8. Anyway, the failure of PTQ makes a big difference between INT8 and FP32, so I focused on making a tool to verify the failure of PTQ.

In addition, I designed to use pre-computed FP32 values written in the JSON file (you can re-compute it anytime with e2e_eval_original_model pytest marker, https://github.com/openvinotoolkit/nncf/tree/develop/tests/onnx#test-for-onnx-framework-guide), rather than computing them everytime. This is because this test requires ~18 hours using ~30 CPU cores. Then, we can expect additional 36 hours to compute FP32. But, FP32 will never change until we change OVEP dependency version: https://github.com/openvinotoolkit/nncf/blob/26ea9fc75705a1ee705e9ef10684846cc8d9bd87/setup.py#L133 So I thought it would be a huge waste to compute FP32 ​​every time.

To see the changed report after fixing, please see build 63 artifacts.

vinnamkim avatar Sep 19 '22 06:09 vinnamkim

image

@vinnamkim thanks, this looks better. A couple of more fixes and I'll merge this.

  1. INT8 accuracy is almost everywhere better than FP32. It is also different from your previous run. This can't be right, and if it is right, we need an explanation for how that is possible.
  2. The legend that describes the meaning of colors is missing. You can copy-and-paste the legend that we have in other E2E tests for this.
  3. If the code do not recalculate FP32 values in a given test run, then please make it so that the corresponding valus in the "FP32" column are given in parentheses (i.e. (52.02)) so that it is immediately obvious from the table which values have been taken from the "cache". We do this in other E2E tests as well. Also please add an entry in the legend that describes what these parenthesized values mean (i.e. as I've described above, that these values have not been computed during this run, but taken from cache instead. Might also want to set up a check for the current onnxruntime-openvino version so that the report will also say whether these values are outdated and should be regenerated.)
  4. Since this is all OVEP results, please set up the table in such a manner that above the accuracy columns there is a merged cell saying OVEP, in the same manner as it is done for TF with their tfrecords designation: image

vshampor avatar Sep 19 '22 09:09 vshampor

image

@vinnamkim thanks, this looks better. A couple of more fixes and I'll merge this.

  1. INT8 accuracy is almost everywhere better than FP32. It is also different from your previous run. This can't be right, and if it is right, we need an explanation for how that is possible.
  2. The legend that describes the meaning of colors is missing. You can copy-and-paste the legend that we have in other E2E tests for this.
  3. If the code do not recalculate FP32 values in a given test run, then please make it so that the corresponding valus in the "FP32" column are given in parentheses (i.e. (52.02)) so that it is immediately obvious from the table which values have been taken from the "cache". We do this in other E2E tests as well. Also please add an entry in the legend that describes what these parenthesized values mean (i.e. as I've described above, that these values have not been computed during this run, but taken from cache instead. Might also want to set up a check for the current onnxruntime-openvino version so that the report will also say whether these values are outdated and should be regenerated.)
  4. Since this is all OVEP results, please set up the table in such a manner that above the accuracy columns there is a merged cell saying OVEP, in the same manner as it is done for TF with their tfrecords designation:
image

Fix Claim 2,3,4. For Claim 1, this is because Build 63 was a short test with the subsets of evaluation datasets.

Build 36 Build 63
Eval size image image

I re-launched Build 65 with full evaluation datasets to match the numbers with the previous one.

vinnamkim avatar Sep 20 '22 01:09 vinnamkim

run pytorch pre-commit tests

vinnamkim avatar Sep 20 '22 15:09 vinnamkim