analyzer icon indicating copy to clipboard operation
analyzer copied to clipboard

Add option to compile and run regression tests to sanity check assertions

Open bottbenj opened this issue 3 years ago • 1 comments

Adds the new options -e and -eq to the regression tests that compiles and run the regression tests and then use the actual results of executing the assertions to check the annotations.

This of course has the issue that actually running the assertions generally doesn't result in an unknown result and some annotations are just never actually reached which results in a bunch of useless errors in -e mode which are suppressed in -eq which only shows issues that indicate definite errors.

This requires that the tests #include <assert.h> instead of relying on a global assert as done in #645.

With that only one assertion malloc_array:10 actually falls because it asserts that an uninitialized value is 4, there is a comment explaining it, so that is probably fine.

Beyond that there are 285 tests that either fail to compile, segfault when running or otherwise don't exit with code 0.

I am not sure how useful this is in general, but i think there is merit to using it at least when writing new tests.

bottbenj avatar May 01 '22 21:05 bottbenj

Compiling

Compiling the regression tests is definitely a sanity check we should have. This was also witnessed by having to add a missing #include <assert.h> to 160 tests using assert in #645.

A number of complications come to mind regarding that though:

  1. Some tests are Linux kernel modules (--enable kernel), which require a great bunch of include directories and some defines. For preprocessing, Goblint sets them up here: https://github.com/goblint/analyzer/blob/db3c996cef0abc45be7032202cd8557d7d5e87d2/src/maingoblint.ml#L244-L280.
  2. Some tests are OSEK code (and we still have ARINC analyses, but AFAIK no tests for them). I have no idea how, if at all, those could be compiled.
  3. On MacOS, we wouldn't want to use gcc (which actually links to Apple's Clang compiler), but some actual GCC compiler from homebrew (e.g. gcc-11). For preprocessing Goblint builds in the logic for finding a suitable version of cpp (not gcc because Goblint only needs to preprocess, not compile). I guess something similar would be necessary for finding the right GCC compiler on MacOS.

Eventually, if the compilation errors are fixed, we should have a CI workflow for doing those checks automatically.

Executing

I am extremely uneasy with actually executing any of the test programs though in such automatic manner. Especially due to stuff like the following: https://github.com/goblint/analyzer/blob/db3c996cef0abc45be7032202cd8557d7d5e87d2/tests/regression/01-cpa/24-library_functions.c#L8 Although most should be harmless, I still personally would never dare to, because besides such directly commented case, we have numerous minimized examples from actual projects, which may make some system calls.

Of course there's no chance of executing Linux and OSEK tests. (Technically the Linux modules could be loaded, but no way in hell I am loading those buggy kernel modules into my running system!)

As you correctly point out, actually testing the assertions via execution isn't really possible, so I'm not sure if we should even bother at the time being. In places a nonsensical regression testing style is used with contradictory conditions, e.g. https://github.com/goblint/analyzer/blob/db3c996cef0abc45be7032202cd8557d7d5e87d2/tests/regression/22-partitioned_arrays/01-simple_array.c#L69-L71 I tried to address this with #278 and move our regression test asserts to closer match the actual runtime semantics of assert, which, among other things, would also really help here.

The other thing is the mixture of many asserts with various expected outcomes in individual tests. For reasonable semantics outside of Goblint's regression testing, one would have to extract a number of combinations of the tests such that they don't contain multiple possibly-failing assertions (otherwise not all assertions would be checked for all results). In #378 a script was created to construct such regression test variations for SV-COMP where standard semantics are assumed. And even then, direct execution wouldn't really be the right way to check whether the assertions and their results match execution, since, as you mention, some UNKNOWN behavior cannot arise from a GCC-compiled binary (e.g. when Goblint soundly analyzes undefined behavior, which doesn't result in random coin flips during runtime). A more appropriate technique might be fuzzing (e.g. AFL), which more intelligently tries to trigger certain executions that would lead to failures.

Summary

Therefore, I would really want a separate option for running the compilation checks, but not executing anything. Possibly even stripping out all the execution-related changes to reduce the maintenance burden of a not-too-useful feature on the already-messy script.

sim642 avatar May 02 '22 02:05 sim642

Thank you again! As we aim to move away from our ruby-based testscript, we are not merging this at the moment.

michael-schwarz avatar May 31 '23 12:05 michael-schwarz