serac icon indicating copy to clipboard operation
serac copied to clipboard

Move to patch testing

Open white238 opened this issue 1 year ago • 1 comments

@samuelpmishLLNL pointed out there is a better way to handle some of the unit tests we are writing. We should create a few examples and document how to do these so we can move forward in future tests in a more robust manner.

Someone needs to further flush this out who understands it.

white238 avatar Jul 26 '22 23:07 white238

Here's a review of what we're trying to test, some of the weird practices taking place in serac's test suite, and some suggestions on how to improve.

In a nutshell, we have some equations that come from the physics, $r_1(x,y) = 0, r_2(x,y) = 0$, that we'd like to solve,

exact_solution

We find an approximate solution by making an initial guess and iterating until the norm of $r$ is smaller than some tolerance (i.e. until the point gets inside the red disk around the solution, tolerance $\approx$ radius), and then stop.

iterate

Up to this point, this is all normal. The unusual part is that our tests don't actually check that the output is close to the solution. Instead, they check that the norm of the output is close to the norm of the accepted answer. That is, any point in the purple ring below will pass the EXPECT_NEAR(norm(accepted_solution), norm(approximate_solution), tolerance) check.

expect_near_1

This is bad because practically all of those points in the purple ring are clearly not even close to the solution, but our test suite would classify them all as correct, which gives the developer the wrong impression that everything is working. For example, an incorrect implementation of those residual equations (represented by dashed lines) might still be inside that purple ring:

expect_near_2

Another bad practice is that we continue to define the "accepted solution" as the output of the code the first time the test is run. This is especially bad because the initial implementation of practically anything will inevitably have bugs, so defining a baseline from the output of an early implementation practically ensures that the baselines will be incorrect, which makes everything really confusing. For example, if the flawed implementation (dashed) was used to set a baseline:

expect_near_3

Then, if you come back later and actually fix the implementation, a correct solution will result in a test failure.


So, what should we do?

  1. Don't write (and don't approve PRs that do):
EXPECT_NEAR(norm(accepted_solution), norm(approximate_solution), tolerance)

Instead, do

EXPECT_NEAR(0, norm(accepted_solution - approximate_solution), tolerance)
  1. Don't just assume that the first output from your code is correct. It likely isn't, and does not make for a suitable baseline. Prefer baselines that come from analytic solutions, manufactured solutions, other trusted codes. If you aren't confident that the baseline is right, then what is the point of having the test in the first place?

  2. Prefer relative errors, when possible. Absolute errors depend on a lot of arbitrary information (mesh size, specific material parameters, etc) that make it difficult to understand scale. 1.0e-12 might be a really tight or a really loose absolute tolerance, depending on what units you use. Taking a ratio cancels out the arbitrary choice of units, and nondimensional tolerances are less error-prone, since they don't require the developer to have a detailed understanding of the specific problem.

samuelpmishLLNL avatar Jul 27 '22 05:07 samuelpmishLLNL