googletest icon indicating copy to clipboard operation
googletest copied to clipboard

"Rotten Green Test" detection

Open pogo59 opened this issue 2 years ago • 6 comments

Does the feature exist in the most recent commit?

No.

Why do we need this feature?

Tests might look like they're doing something useful when they aren't. Finding test assertions that aren't actually executed is not always easy. This feature helps.

Describe the proposal

Inspired by "Rotten Green Tests", Delplanque et al., ICSE 2019 DOI 10.1109/ICSE.2019.00062

A Rotten Green Test is a test assertion that looks like it is verifying something useful about code behavior, but in fact the assertion is never executed. Because the test didn't explicitly fail, it's assumed to have passed. The proposed feature instruments Google Tests to detect EXPECT_* and ASSERT_* calls that are not executed, indicating that they are Rotten Green Tests.

In order to avoid false positives, it does not report any test that was filtered out, skipped, disabled, or otherwise not run. It also don't report any failed tests, because failures (especially with ASSERT* macros) might well skip other assertions, and so reporting those as rotten isn't really useful. (This is Rotten Green Test detection, after all.)

It achieves this by instrumenting the macros to statically record the source location, and dynamically mark the location as executed when the assertion passes. Tests with rotten assertions have a new result status, which lets them fit into the normal reporting mechanisms.

Test authors do nothing new. The feature is automatically enabled, although there are ways to turn it off: a compile-time toggle in gtest-port.h, or a command-line option when running the test. (The latter is handy for disabling the feature in a batch mode, but leaving it enabled when developers run an individual test manually.)

The output from a rotten test is modeled on other kinds of test reporting. Here's an example from the LLVM project, which I've used as a guinea pig for developing the feature (finding a variety of test issues, most of which have been fixed). The source location of the rotten assertion appears immediately, and there's a summary at the end.

...
[ RUN      ] OpenMPIRBuilderTest.SingleDirectiveNowait
/home/probinson/ssd/projects/llvm-org/llvm-project/llvm/unittests/Frontend/OpenMPIRBuilderTest.cpp:3070: Rotten 
[  ROTTEN  ] OpenMPIRBuilderTest.SingleDirectiveNowait (0 ms)
...
[==========] 52 tests from 1 test suite ran. (9 ms total)
[  PASSED  ] 52 tests.
[  ROTTEN  ] 4 tests, listed below:
[  ROTTEN  ] OpenMPIRBuilderTest.ParallelSimple
[  ROTTEN  ] OpenMPIRBuilderTest.ParallelIfCond
[  ROTTEN  ] OpenMPIRBuilderTest.ParallelCancelBarrier
[  ROTTEN  ] OpenMPIRBuilderTest.SingleDirectiveNowait

  YOU HAVE 4 ROTTEN TESTS

Proving a test is rotten to a skeptical developer is easy: replace the rotten EXPECT or ASSERT with assert(false); and it won't trigger.

I would like to integrate this feature into upstream googletest. At the moment I have not upstreamed it to LLVM, although I've had discussions with some people who see how helpful it has been.

Here's a presentation I gave at the LLVM Developers Meeting last fall. https://www.youtube.com/watch?v=shBJwHiyirc This is based on my first prototype, before I had the idea to invent a new test result, but the basic premise is still the same.

Is the feature specific to an operating system, compiler, or build system version?

No, although some aspects of the implementation are. My LLVM prototype runs successfully on Linux and Windows, using Clang, gcc, and MSVC compilers. The implementation has a hard dependency on static initialization, and being able to find all the statically initialized bits at runtime; this has required a different solution for each of these environments.

It doesn't work everywhere; GCC older than 8.0 has a bug that prevents it from working (causes a compile-time error) for EXPECT/ASSERT within a lambda. This was the main motivation for having a compile-time toggle to turn the feature off.

I've had no opportunity to try this on a Mac.

pogo59 avatar Apr 29 '22 21:04 pogo59

This sounds useful. We would consider accepting this change. However, before we start reviewing any pull requests, we would like to take a look at a comprehensive design proposal.

dinord avatar May 02 '22 19:05 dinord

Detailed description of Rotten Green Test detection in googletest.

Overview

Rotten Green Test detection consists of (a) statically identifying all test assertions; (b) dynamically recording which assertions are executed; (c) reporting un-executed ("rotten") assertions. Steps (a) and (b) are handled by adding appropriate instrumentation to the existing assertion macros (EXPECT_* and ASSERT_*), step (c) uses a new TestEventListener subclass to peruse the assertions and report a new kRotten TestPartResult type as appropriate.

Identifying All Test Assertions

Instrumenting Assertion Macros

The vast majority of EXPECT_* and ASSERT_* macros ultimately invoke the GTEST_ASSERT_ macro. We modify the GEST_ASSERT_ macro very simply, to do the work we need to identify and record the execution of each assertion.

The original macro is this:

#define GTEST_ASSERT_(expression, on_failure) \
  GTEST_AMBIGUOUS_ELSE_BLOCKER_ \
  if (const ::testing::AssertionResult gtest_ar = (expression)) \
    ; \
  else \
    on_failure(gtest_ar.failure_message())

Note that the macro does not end with a semicolon or wrap itself in braces; this permits users to stream additional data to the failure message. So, our instrumentation cannot interfere with these features. Instead, we instrument the success path of the assertion:

#define GTEST_ASSERT_(expression, on_failure) \
  GTEST_AMBIGUOUS_ELSE_BLOCKER_ \
  if (const ::testing::AssertionResult gtest_ar = (expression)) \
    { GTEST_RGT_DECLARE } \    <--- the new bit
  else \
    on_failure(gtest_ar.failure_message())

This means that we can only record the execution of passing test assertions; but that's okay, because this is Rotten Green Test detection.

We do similar instrumentation for the various GTEST_TEST_* macros that do not use GTEST_ASSERT_.

We also have to do something special for EXPECT_[NON_]FATAL_FAILURE macros. These wrap assertions that are expected to fail, and therefore will not have their executed flag set; but because the containing Test actually passes, these expected-fail assertions will incorrectly appear as rotten. To avoid reporting these false positives, we have the EXPECT_[NON_]FATAL_FAILURE macros set a flag in the containing TestInfo, and we later skip Rotten reporting for those Tests.

Test Assertion Data

The GTEST_RGT_DECLARE_ macro statically defines a struct describing where the assertion occurs, and a flag indicating whether the assertion was executed. The struct looks like this.

struct RgtStaticItem {
  ::testing::TestInfo *const *test_info; // Handle on the containing Test
  const char *file;                      // __FILE__
  int line;                              // __LINE__
  bool executed;                         // Whether it was executed
  constexpr RgtStaticItem(::testing::TestInfo *const *test_info,
                          const char *file, int line) :
      test_info(test_info), file(file), line(line), executed(false) {}
};

The constexpr constructor is critical; RgtStaticItem must be initialized statically. (Dynamic initialization occurs only if control flows over the declaration of the item, and the entire point of this feature is to detect when that doesn't happen.)

Associating Assertion Data With Tests

The RgtStaticItems conceptually form a single array. (Arranging this can be tricky depending on the build environment; details on this later, see "The Gory Bits" below.) But what we really want is a per-Test array of these items, so that each Test knows which assertions belong to it, and it can do appropriate reporting.

Before Tests begin running, each TestInfo gets an array of pointers to the RgtStaticItems for assertions contained within that Test. (We can't set that up statically, because the TestInfo instances are constructed dynamically.) To achieve this, we define a new subclass of TestEventListener named RgtListener. Its OnTestProgramStartup method runs through the array of all RgtStaticItems, and records a pointer to each RgtStaticItem in a vector in the associated TestInfo instance. The TestInfo instances have been created by the time OnTestProgramStartup runs.

Recording Executed Assertions

This is really the simplest thing ever: GTEST_RGT_DECLARE_ has a statement setting the executed field for the current assertion to true. Because of how the macro instrumentation works, this can only happen for a passing assertion, but again that's okay because we only care about green (passing) assertions.

Reporting Results

Identifying Rotten Assertions

The short and simple description is this:

RgtListener::OnTestEnd looks at the RgtStaticItems for the given Test, and for any that are not executed it invokes GTEST_MESSAGE_AT_ with the source coordinates of the assertion, and a result of kRotten.

There are complexities, of course. RgtListener::OnTestEnd will do nothing, in the case of a Test that does not otherwise pass; this automatically suppresses rotten reporting for a Test that fails, was skipped, disabled, or otherwise wasn't run. (A failing Test failed at least one assertion, which won't be marked as executed, and reporting failing assertions as rotten is just noise.) Similarly, if a Test is flagged as using EXPECT_[NON_]FATAL_FAILURE it will skip rottenness reporting.

RgtListener::OnTestEnd also has to de-duplicate data. Duplicate data is most likely to occur in a template that contains test assertions. Each instantiation has its own RgtStaticItem, but they all have the same source location. It could easily be the case that different instantiations will take different paths, and so not all static items for a given instantiation will show as executed. But, if each path through the template is taken by at least one instantiation, then the test assertion is not rotten, and we should not report it. Therefore, RgtListener::OnTestEnd constructs a temporary map with source locations as the keys, and logically "or"s together all the executed flags before reporting any rotten results to GTEST_MESSAGE_AT_.

Reporting to the User

Test results are reported to the user by PrettyUnitTestResultPrinter and the Rotten case is no different. The appropriate simple changes to accommodate a new result type take care of that.

By default, any Rotten test result will cause the test program to fail. There is a command-line option --gtest_treat_rotten_as_pass to suppress that behavior; rotten tests will still be reported, but if there are no other failures then the overall test program will pass. This is useful in test suites that haven't been fully cleaned up, allowing them to run without failing (just like they did before the RGT feature was added) until the test suite owners have had a chance to correct the tests.

Assertions Not In Any Test

One necessary bit of information for the RgtStaticItem is which Test is it part of; specifically, it needs the TestInfo associated with the containing Test function. We achieve this, while still getting guaranteed compile-time initialization, by capturing the address of the static test_info_ member of the Test's class. (Later during startup, the test_info_ member has been initialized, and we use it to find the TestInfo for the Test.)

But, assertions don't have to be (lexically) contained within a Test; they can be in a helper function outside of any Test. What to do with these "orphan" test assertions?

What we do is define a global variable with the same name as the test_info_ member. That way, if a test assertion is defined inside a Test, the name reference from GTEST_RGT_DECLARE_ resolves to the class member; if it is defined outside any Test, the name resolves to the global variable. The global is constant-initialized to nullptr so we can tell whether any given static item is an orphan.

Because it seems wildly inappropriate to have a global variable named test_info_ this data member has been renamed to gtest_test_info_. Note that it must be an unqualified name in the global namespace, otherwise the context-dependent name resolution isn't going to work. Suggestions for better names are more than welcome.

RgtListener::OnTestProgramStartup creates a vector of the orphan assertions on the side, equivalent to the vector in each TestInfo. This allows the orphan assertions to be checked and reported at some appropriate point.

I am uncertain what that "appropriate point" would be; presumably it would be another RgtListener method, but it's unclear to me which one would work best for that.

The Gory Bits

The "Array" of RgtStaticItem: Concept vs Reality

As mentioned previously, the RgtStaticItems conceptually form an array of data, one element per test assertion. In some environments, it actually is that kind of array, although we do some slightly tricky things in order to access it. In other environments, we need to do more drastic things to get the equivalent effect.

The current implementation works for Clang, gcc, and Windows MSVC. Each of these works a little differently.

Clang

This is the simplest case. Each RgtStaticItem is allocated to an object-file section named GTEST_RGT. The linker ensures that all contributions to that section will end up adjacent in the executable image (although order isn't guaranteed). The linker will automatically provide the symbols __start_GTEST_RGT and __end_GTEST_RGT which allows code to know the bounds of the array at runtime. This is a known feature of GNU linkers, and LLVM's lld.

GCC

The same trick almost works with gcc. The one case where it doesn't work is when a test assertion is contained in an inline function; in that situation, gcc insists on certain section attributes, and those are inherently incompatible with using the section anywhere else. There is a stackoverflow question specifically about this.

In order to address this, we apply the solution described here. In short, allocate the static item without any section attribution, then generate a helper function to record its address elsewhere, and use inline asm to create a .init_array entry to invoke the helper function. gcc doesn't complain about using .init_array in an inline function because that's a predefined well-known section.

This means RgtListener::OnTestProgramStartup needs to do a little extra indirection to find all the static data items, but it can still find them.

Note, gcc versions prior to 8.0 reportedly don't permit inline asm within a lambda. I can report that this is certainly true for gcc 7.5. See "Avoiding Troublesome Situations" below for how we keep users from running into this problem.

Windows MSVC

In this environment, it appears that allocating arbitrary data to a custom section has peculiar padding issues, and it was infeasible to access the elements as an array. Instead, we imitate the gcc solution, allocating the static elements without any special attributes, using a helper function, and capturing a pointer to that function into the special section. This makes the padding problem much more tractable, because the padding shows up simply as null pointers. We do have to call the helper functions manually, instead of it happening automatically in the .init_array case, but functionally it works the same as the gcc case.

Also, we need to define our own symbols for the bounds of the array of function pointers, because Windows does not provide that on its own. Fortunately Windows has strict lexical section-ordering rules, so it's straightforward to do this (conceptually straightforward; it requires some C macro indirection to get it all right).

Avoiding Troublesome Situations

As noted above, RGT makes use of a feature that older gcc compilers do not support. To accommodate this automatically, the entire RGT feature is under a GTEST_HAS_RGT toggle, and this is automatically set in include/gtest/internal/gtest-port.h (and can be overridden in include/gtest/internal/custom/gtest-port.h).

One reason to override this setting would be that your test suite doesn't actually use lambdas, or at least doesn't put test assertions into them. In that case using an older gcc is perfectly fine, and in fact I have been using gcc 7.5 for most of my work.

Even if the test suite does use test assertions in lambdas, I've created yet another workaround. Sharp-eyed readers will have noticed that in the discussion of changes to GTEST_ASSERT_ the instrumentation macro is named GTEST_RGT_DECLARE but elsewhere it's named GTEST_RGT_DECLARE_ (with trailing underscore). This allows an only mildly annoying workaround to the lambda problem: Disable RGT for just the individual Test. This can be done as follows.

#undef GTEST_RGT_DECLARE
#define GTEST_RGT_DECLARE
... troublesome test ...
#undef GTEST_RGT_DECLARE
#define GTEST_RGT_DECLARE GTEST_RGT_DECLARE_

This is kind of an awkward thing, and I am not at all attached to it. I can easily imagine not wanting to include it in googletest! But I decided to describe it because I want this document to be thorough.

Testing

The prototype was implemented as a private change against the LLVM project. Clang and LLVM use googletest for implementing several dozen test programs, and I'm familiar with that code base, so it seemed like a good starting point. The work has resulted in a number of test fixes; fortunately none of the fixes then revealed underlying code flaws.

Probably the largest number of test fixes have been to replace simple "return" statements with GTEST_SKIP() when the test function made some runtime decision about whether to run. In at least one case this required some minor refactoring to split a test into multiple functions that could either be entirely run or entirely skipped. However there was also one case where SetUp() wasn't being called at all, and almost an entire test program was implicitly disabled without any indication. Re-enabling the test took significant effort as functionality had evolved without the test being updated.

For incorporation into upstream googletest, obviously the feature will need its own set of tests, which have not been written yet.

Data Size and Runtime Cost

Naturally it's worth looking at how much space and time it takes to support the RGT feature. The cost obviously will vary based on how extensive an individual test program is, both in terms of number of test assertions and the length of the file paths.

The static data cost consists of:

  • RgtStaticItems, one per assertion
  • For gcc or MSVC, one additional pointer per assertion
  • Full filename strings, one per source file (that contains an assertion)

On a 64-bit system, RgtStaticItem (two pointers, one int, one bool) is 24 bytes per element; plus 8 when using gcc or MSVC, because of the need to capture a helper function pointer. For a test program with ~8000 assertions (more than any Clang or LLVM unittest) this is around 256KB.

Dynamically, each static item will cost an additional pointer plus a bit of overhead, for the vectors attached to each TestInfo instance that point to "their" assertions. For our example 8K-assertion program that's another 64K give or take. There is also the temporary map used to deduplicate info during reporting, but these will be small (one map element per assertion at worst) and ephemeral (created and destroyed within the same function) so not worth worrying about.

Filename strings tend to run under 100 bytes, IME, unless you have some really deep nesting in your directory tree. But, there's only one copy per source file, so the net for even a large test program is unlikely to be more than a few KB.

The runtime cost is (a) generating the per-TestInfo vectors on startup, (b) deduplicating and iterating the info when reporting.

Here are some statistics from an LLVM unittest (ADTTests) with 10876 assertions across 1337 Test functions. This is the most assertions in any Clang/LLVM unittest by about a factor of 2. Compiled by gcc with and without RGT to show the difference.

gcc w/RGT gcc no RGT difference
.data 0x5440d 0xcf5 0x53718
.init_array 0x15668 0x288 0x15e30
runtime (avg of 5) 1.770 1.774 -.004

Note the .init_array difference is exactly as expected: 8 x 10876. The runtime difference is imperceptible. The .data difference is more than the RgtStaticItem total would account for, not clear what's going on there.

So, RGT has a distinct cost in static data size and therefore executable size, but the runtime cost is next to nothing.

pogo59 avatar May 04 '22 00:05 pogo59

What we do is define a global variable with the same name as the test_info_ member. That way, if a test assertion is defined inside a Test, the name reference from GTEST_RGT_DECLARE_ resolves to the class member; if it is defined outside any Test, the name resolves to the global variable. The global is constant-initialized to nullptr so we can tell whether any given static item is an orphan.

It occurred to me last night that defining gtest_test_info_ in the anonymous namespace would avoid introducing a variable name into the global namespace. While defining/initializing a global variable inside an anonymous namespace in a header violates multiple coding standards, it turns out to be exactly what we need in this case:

  • the unqualified name resolves to the class member from inside a Test;
  • the unqualified name resolves to the (now per-compilation-unit) variable from outside a Test;
  • does not introduce any names to the global namespace;
  • having multiple instances of the gtest_test_info_ variable is okay, because it only ever contains nullptr which is the sentinel value for "I'm not inside a Test.

It does still introduce a name into the test program's namespace, but I don't see any way around that. So, still kind of evil, but less evil than a true global variable.

pogo59 avatar May 05 '22 14:05 pogo59

Is there something else you want to know? I should be able to get started on unit tests for the new feature in the near future.

pogo59 avatar May 23 '22 20:05 pogo59

While investigating a failure in googletest-output-test, I've discovered that my instrumentation doesn't work well for TEST_P, TYPED_TEST, and TYPED_TEST_P. For those tests, the TestInfo* is not statically identifiable, unlike with TEST and TEST_F. (TYPED_TEST[_P] uses templates, in particular; defining and initializing a templated static member, so that the instrumentation can find it, doesn't appear to be feasible.) Associating assertion records with TestInfo is important for weeding out false positives due to filtering, skipping, sharding, and so on.

Do you think it's reasonable to say that RGT detection simply doesn't work well with parameterized tests? Or is that a blocker? If it's a blocker, I think I'd have to come up with a separate static analysis pass to do the association of assertions with tests, which both a chunk more work and obviously not as user-friendly.

pogo59 avatar Jul 14 '22 15:07 pogo59

Lacking any feedback, I'll assume RGT detection would have to work with all googletest features, and my current design of adding a new "kRotten" test result isn't going to work.

While the instrumentation of the EXPECT/ASSERT macros is fine, the problem then is how to associate them with a particular TEST function. Going back to the original paper, they used a static analysis tool to do this. I think having a separate tool is not particularly user-friendly, but it should be comparatively cheap to scan source files that report un-executed assertions, and identify source ranges for each TEST function. I'll start working on that.

pogo59 avatar Aug 01 '22 15:08 pogo59