mbedtls icon indicating copy to clipboard operation
mbedtls copied to clipboard

Result analysis misses generated test cases

Open gilles-peskine-arm opened this issue 1 year ago • 8 comments

Result analysis (analyze_outcomes.py) runs without generated files. This causes it to miss the test cases in automatically generated .data file.

Consequences:

  • We may be missing some genuine failures, e.g. driver testing parity failing on automatically generated test cases.
  • We can have failures come up when we make a release and newly run the CI with all the generated .data files checked in.

Note that this may best be solved in the Groovy code in https://github.com/Mbed-TLS/mbedtls-test rather than the python script itself.

gilles-peskine-arm avatar Oct 04 '23 10:10 gilles-peskine-arm

I'm not entirely sure why the script is looking at the files present in order to get a list of test suites, rather than getting it from the outcomes file. Is there an obvious reason I'm missing?

mpg avatar Oct 04 '23 10:10 mpg

Note that this may best be solved in the Groovy code in https://github.com/Mbed-TLS/mbedtls-test rather than the python script itself.

Why wouldn't the script itself call make generated_files? It's not at all uncommon for developers to run this script locally, so I'd rather avoid having it behave differently in the CI / having gotchas when used locally, if possible.

mpg avatar Oct 04 '23 10:10 mpg

why the script is looking at the files present in order to get a list of test suites, rather than getting it from the outcomes file

One of the things the script does is find test cases that were never executed. This needs knowledge of what's not in the outcome file.

Even for things that are in the outcome file, the script needs to know which ones it's supposed to consider: unit tests vs SSL tests vs whatever else might be recorded in the outcome file.

gilles-peskine-arm avatar Oct 04 '23 11:10 gilles-peskine-arm

One of the things the script does is find test cases that were never executed. This needs knowledge of what's not in the outcome file.

Ok, in my mind it was about finding test cases that are always skipped in all configurations we run on the CI - which is almost but not quite the same as "never executed". The only difference I can think of is if we never run some test suite at all, and I might be overly optimistic, but that's not something that worries me quite as much as tests that are always skipped.

Even for things that are in the outcome file, the script needs to know which ones it's supposed to consider: unit tests vs SSL tests vs whatever else might be recorded in the outcome file.

  1. It seems to me that the script should consider everything anyway. We want to know about always-skipped / never-executed test cases wherever they are.
  2. If the script needs to distinguish between unit tests vs the rest, that seems easy enough to do based on names (does it start with test_suite_?)

I wonder if the script could adopt a mixed approach:

  • construct the list of expected test suites from the filesystem,
  • also construct it from the outcomes file,
  • compare and complain loudly about any difference (in either direction).

This would have caught the present issue, and would also catch it if we even fail to run some test suite at all. Wdyt?

mpg avatar Oct 06 '23 08:10 mpg

if we never run some test suite at all

This may have happened in the past when test suites were enumerated explicitly in both tests/Makefile and tests/CMakeLists.txt. At least, we had test cases that were never executed because they were missing from one of these and only enabled in a configuration that we tested with one of the build systems. I'm not sure if we've ever had test suites that were never run at all.

But anyway, the historical explanation for why the script enumerates available test cases is that it was the original goal was defined as “find test cases that are not executed”, not “find test cases that are always skipped”. I didn't set out to check whether all test suites are executed, but it seems a natural part of the goal. Enumerating always-skipped test cases seems like a roundabout way of achieving a weaker goal and I don't think I even considered it.

It seems to me that the script should consider everything anyway.

It can't: we may add other things that the script doesn't care about. For example, I'd like to record some information about builds in the outcome file, to help with failure analysis — for example, if a lot of components fail, trying to figure out what they have in common. analyze_outcomes wouldn't know about those rows and should just ignore them.

compare and complain loudly about any difference (in either direction).

Yes, we can do that, with some suitable filtering (as you write, only look at test_suite_.* and ssl-opt.* test suite names). Filed as https://github.com/Mbed-TLS/mbedtls/issues/8318

gilles-peskine-arm avatar Oct 06 '23 10:10 gilles-peskine-arm

Thanks for the explanations!

For example, I'd like to record some information about builds in the outcome file, to help with failure analysis — for example, if a lot of components fail, trying to figure out what they have in common

Wasn't aware of those plans, sounds very promising!

mpg avatar Oct 06 '23 10:10 mpg

I wonder if the script could adopt a mixed approach:

* construct the list of expected test suites from the filesystem,
* also construct it from the outcomes file,
* compare and complain loudly about any difference (in either direction).

Case in point: I'm investigating issues with compat.sh in 2.28 and I'm seeing issues both ways:

  1. 1144 of the tests listed by compat.sh --list-test-cases are not executed - this is reported by analyze_outcomes.py as expected;
  2. but also, 1108 of the tests that actually run and pass on the CI (as seen in outcomes.csv) are not listed by compat.sh --list-test-cases - this is currently not seen at all by analyze_outcomes.py and I only found out about it while investigating the first issue.

The 2nd issue makes me quite uncomfortable as well, especially the fact that I only noticed by chance, so I think analyze_outcomes.py should also catch that in the future.

mpg avatar Apr 09 '24 07:04 mpg

The Git comment in https://github.com/Mbed-TLS/mbedtls-test/commit/ccf4faadb4261cbcfd3c988592ebfaa60e480b2e is incorrect. https://github.com/Mbed-TLS/mbedtls-test/pull/166 does most of the work of fixing this issue, but a little intervention in the mbedtls repository is needed as well

gilles-peskine-arm avatar May 21 '24 18:05 gilles-peskine-arm