bevy_tweening icon indicating copy to clipboard operation
bevy_tweening copied to clipboard

Improve testing framework

Open SUPERCILEX opened this issue 3 years ago • 9 comments

The current tests are quite painful to work with since it's very difficult to tell what the expected values should actually be. This PR flips the testing around by passing in a sequence of expected values that the test framework will validate.

Note: depends on https://github.com/djeedai/bevy_tweening/pull/19.

SUPERCILEX avatar Aug 03 '22 02:08 SUPERCILEX

@djeedai This PR is ready and is already showing its value by making it super easy to see that a bunch of stuff is broken (the transforms, completion counts, progress). I didn't bother converting the sequence/track/delay tests because I'll do those in a pre-PR before rewriting the implementations.

SUPERCILEX avatar Aug 04 '22 01:08 SUPERCILEX

@djeedai This one should be ready for another round of review. It'd also be nice to get a decision on #44 since #38 will need to update Animator APIs otherwise.

SUPERCILEX avatar Aug 10 '22 01:08 SUPERCILEX

Ok I've been thinking about these tests some more and I'm realizing that I've been dumb. What I'm trying to do is replicate expects tests but I'm forcing us to write code to generate the goldens instead of just using a golden file test. I'm going to switch to that which should be way less code and even easier to understand.

SUPERCILEX avatar Sep 21 '22 15:09 SUPERCILEX

@djeedai Ok, WDYT?

SUPERCILEX avatar Sep 21 '22 19:09 SUPERCILEX

@djeedai Ok, WDYT?

I think I don't like goldens conceptually. I can see the value for something like a codegen tool, or some logs, or other text-heavy things. But using that to compare numerical values output as text sounds pretty contrived when you can compare the values directly. This also removes the value of having an assert!() telling you exactly where the test failed; I guess you can find it back diff'ing the stdout, but e.g. CI probably won't tell you right away. rstest was fine, but personally I don't like that new iteration.

djeedai avatar Sep 21 '22 22:09 djeedai

This also removes the value of having an assert!() telling you exactly where the test failed

I strongly disagree here. The asserts are basically useless which is part of why I'm trying to get rid of them. Let me demonstrate with an incorrect completion count.

Here's what a failing golden test looks like:

failures:

---- tweenable::tests::tween_tick_loop_finite stdout ----

Goldenfile diff for "tween_tick_loop_finite.stdout":
To regenerate the goldenfile, run
    REGENERATE_GOLDENFILES=1 cargo test
------------------------------------------------------------
thread 'tweenable::tests::tween_tick_loop_finite' panicked at 'assertion failed: `(left == right)`'
...

Differences (-left|+right):
 1.333333332s/1s elapsed
 Direction: Forward
 State: Active
 Progress: 1.3333334
-Total completions: 0
+Total completions: 1
 Transform: Transform { translation: Vec3(1.3333334, 1.3333334, 1.3333334), rotation: Quat(0.0, 0.0, 0.0, 1.0), scale: Vec3(1.0, 1.0, 1.0) }
 Event received: TweenCompleted { entity: 42v0, user_data: 69420 }
...

The OG tests:

failures:

---- tweenable::tests::tween_tick stdout ----
TweeningType: count=Finite(1) strategy=Repeat dir=Forward
Expected: progress=0.2 factor=0.2 times_completed=0 direction=Forward state=Active just_completed=false translation=Vec3(0.2, 0.2, 0.2)
Expected: progress=0.4 factor=0.4 times_completed=0 direction=Forward state=Active just_completed=false translation=Vec3(0.4, 0.4, 0.4)
Expected: progress=0.6 factor=0.6 times_completed=0 direction=Forward state=Active just_completed=false translation=Vec3(0.6, 0.6, 0.6)
Expected: progress=0.8 factor=0.8 times_completed=0 direction=Forward state=Active just_completed=false translation=Vec3(0.8, 0.8, 0.8)
Expected: progress=1 factor=1 times_completed=0 direction=Forward state=Completed just_completed=true translation=Vec3(1.0, 1.0, 1.0)
thread 'tweenable::tests::tween_tick' panicked at 'assertion failed: `(left == right)`
  left: `1`,
 right: `0`', src/tweenable.rs:1096:21

The previous iteration of the new framework:

failures:

---- tweenable::tests::tween_tick_loop_finite stdout ----
thread 'tweenable::tests::tween_tick_loop_finite' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `1`: times_completed', src/tweenable.rs:1058:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

For me, the goldens are hands down the easiest to understand. The OG one requires you to look at the code to find out which value is wrong while the previous iteration clearly needs some context on the current state of the tween. That aside, I think this goes to your next point:

But using that to compare numerical values output as text sounds pretty contrived when you can compare the values directly.

I kind of agree here. But! The problem I'm trying to solve is understandability and ease of test addition. The OG tests are pretty much impossible to understand with all the indexing and branching (which also means it's impossible to add/update the tests). In the previous iteration, I think it became reasonable to understand which values were expected, but it was still hard to reason about the state context (i.e. after the 7th tick, what should my completion count be?). That meant writing the tests was still a pain because you have to come up with the math to generate the right values and it's super easy to make off-by-one errors or just goof the math. I do agree that the goldens are a bit of a weird approach, but they make writing tests effortless (just create a tween and say what your deltas are) and most importantly they're super easy to audit (they read like a story).

I generally bias towards doing whatever makes the tests easiest to write, because that will lead to the most coverage and therefore the most bugs found.


Alrighty, WDYT now? :grin: Any of that convincing?

SUPERCILEX avatar Sep 22 '22 09:09 SUPERCILEX

I like the idea and rationale to push for tests which are both easier to understand and to maintain. But I'm not convinced we need goldens for that.

I think the value argued here is that the version with goldens outputs a human understandable text output that developers can look at to get some context on the failure. On the other hand, the fact goldens use text compare to check correctness is not required, and probably not wanted at all. Conversely, the assert version misses clarity for developers.

So can we have the best of both worlds? Can we output to console the same paragraph of summary text information for each step, and then right after that use assert!() to check the validity of each item? That way we retain the debugging / maintainability of tests and avoid the text comparison. I think if we output anything with println!(), the test will by default swallow the output if it passes, and only show the test output if it fails (if I remember cargo correctly), which is exactly what we want I think. Any thoughts on this?

djeedai avatar Sep 22 '22 12:09 djeedai

I think the value argued here is that the version with goldens outputs a human understandable text output that developers can look at to get some context on the failure.

Yes except for the last part. I've been thinking about this off and on and I've discovered the following underlying assumption motivating the golden approach: I believe writing the math to generate correct values is too difficult. That is, generating the sequence of correct values takes too long to think about (so fewer tests will get written: to put numbers on this, writing the 6 tests in dffbaec (#53) took ~1h30m) and involves too much cognitive overhead (meaning it is too difficult for someone to reason about the correctness of the test). And then the sub-assumptions there are that tests should be easy to write and easy to check for correctness. A failing assert is irrelevant in this scenario because I'm saying you don't know whether or not it should have failed. A passing test that's wrong is much scarier than a failing test. (See the appendix for code examples.)

Hence, the golden approach embraces the idea that figuring out what the correct values are is too difficult and just says, "Hey, check out what this program did!" Correctness is then determined by deciding whether or not you agree with what the program did rather than pre-determining what the program should do.

On the other hand, the fact goldens use text compare to check correctness is not required, and probably not wanted at all.

I think this misses the key difference presented above: do you want to write a test that checks what the program did or one that determines what the program should do? The argument is that "did" is much easier to write and understand (for correctness) than "should do."

So can we have the best of both worlds? [...] Any thoughts on this?

If the golden approach is abandoned, I'll definitely do this since it will make understanding the source of failures much easier, but it won't address the difficulty in writing the test or reasoning about its correctness when it's passing.


Appendix

Is the transform correct?

let tween = Tween::new(
    EaseMethod::Linear,
    Duration::from_secs(1),
    TransformPositionLens {
        start: Vec3::ZERO,
        end: Vec3::ONE,
    },
)
.with_direction(TweeningDirection::Forward)
.with_repeat_count(RepeatCount::Finite(3))
.with_repeat_strategy(RepeatStrategy::Repeat);

let expected_values = ExpectedValues {
    ...
    transforms: successors(Some(0.), |progress| Some(f32::min(3., progress + 1. / 3.)))
        .map(|progress| Transform::from_translation(Vec3::splat(progress))),
};

We both know it's not because of #42, but I would argue that's not immediately obvious. You could make it easier to see by expanding out the successors call (i.e. [0., 1./3., 2./3., ...]), but writing that test would be a massive PitA. Furthermore, it doesn't protect you from [0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 3] where you have no idea if the numbers start in the right places. Maybe it should be [0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3], who knows. The golden shows you those values in context so you can easily reason about their correctness:

Initial tween state:
Lens=TransformPositionLens {
    start: Vec3(
        0.0,
        0.0,
        0.0,
    ),
    end: Vec3(
        1.0,
        1.0,
        1.0,
    ),
}

...

Tick by 333.333333ms:
Completion received for 42v0
1.333333332s/1s elapsed
Direction: Forward
State: Active
Progress: 1.3333334
Total completions: 1
Transform: Transform { translation: Vec3(1.3333334, 1.3333334, 1.3333334), rotation: Quat(0.0, 0.0, 0.0, 1.0), scale: Vec3(1.0, 1.0, 1.0) }
Event received: TweenCompleted { entity: 42v0, user_data: 69420 }

SUPERCILEX avatar Sep 24 '22 18:09 SUPERCILEX

That is, generating the sequence of correct values takes too long to think about (so fewer tests will get written: to put numbers on this, writing the 6 tests in dffbaec (#53) took ~1h30m) and involves too much cognitive overhead (meaning it is too difficult for someone to reason about the correctness of the test). And then the sub-assumptions there are that tests should be easy to write and easy to check for correctness. [...] Hence, the golden approach embraces the idea that figuring out what the correct values are is too difficult and just says, "Hey, check out what this program did!" Correctness is then determined by deciding whether or not you agree with what the program did rather than pre-determining what the program should do.

Fair enough, it's hard to write animation tests because there's a fair number of moving variables, and we're testing them in groups so we need to reason about all of them at once. And maybe that's the issue, we should test individual quantities (repeat count, duration, progress, etc.) in separate tests. But that doesn't mean making it easier to write those test should take priority over having tests that are correct.

  • Text comparison for floating point values, or for any numeric value for that matters, is never guaranteed to produce anything useful by any library of any language I know (except maybe libraries explicitly trying to provide that edge-case feature, if any). So testing that floating-point values output by the library are correct by looking at their text representation is extremely brittle; there's no telling what factors are influencing the conversion, and certainly Rust doesn't guarantee it, even less so across compiler versions. I do not want this library to rely on such an approach; with the moving of progress as a second-class citizen and the use of Duration everywhere we were actually moving in the right direction of correctness, where we can hope to achieve exact bit-level accuracy on playback with the mess of floating-point inaccuracies. Here with goldens this would move us backward in the complete opposite direction.
  • You said it, and we obviously have a different conclusion on whether this is wanted, but goldens mean that instead of writing a test about what you think the program does and having the test confirm that (and if not, you find a bug and fix it), now you actually tell the test "tell me what the program actually does", and then I'll have a look at a text file and try to figure out if that's correct, and give a single binary yes/no answer for the entire content of a large file containing hundreds of moving data points. If anything, that sounds exactly what you're describing you want to avoid, only just worst. I simply cannot see how this can ever prevent wrong assumptions and writing incorrect tests. It's already hard enough to write an unbiased test, but now if a developer needs to analyze a 50- to 100-line file instead of focusing on the value a single assert!() should have they're bound to make an error and assume the golden is correct when it's not.

I think this misses the key difference presented above: do you want to write a test that checks what the program did or one that determines what the program should do? The argument is that "did" is much easier to write and understand (for correctness) than "should do."

Yes, "did" is much easier to write. It's also not what a test should do. I know you disagree with that, but I'll repeat my position which is that a test is here to confirm the developer's assumptions of how the code they wrote is supposed to behave. You have a mental model, convert it into some code, then write a test that says "here I wrote the code to achieve this result". And then you run the test and confirm that assumption. And to the extreme, with test-driven development, you actually write the test first.

I see very little value in testing what the program actually does (as opposed to what it was meant to do):

  • if it's doing what's expected, then it's actually the same as testing the expected behavior as above, so there's nothing to discuss.
  • if not, then you have either of:
    • the only real case of interest, an unexpected behavior, and here "unexpected" can only means "differs from documentation" because that's the only other source of behavior documenting that is shared by everyone, developers and users.
    • a documented behavior, and then that's not a bug but a feature so there's nothing to discuss.

In short, testing the actual behavior serves as a second set of documentation (you said it yourself) that necessarily conflicts with the textual docs (comments), otherwise you'd have a feature. It's also all too easy to forget about such a "wrong behavior" test in the middle of all the other tests. It encourages users looking at the test code to think of bugs as features and take a dependency on them, backing developers in a corner where the "bug" stops being a bug and can never be fixed again (looking at you Windows API, and so many others). Finally, I don't see where you draw the line between an "acceptable" wrong behavior and an "unacceptable" one. Is the behavior acceptable for users if the library produces "3.999999" instead of the "4.0" that the developer intended? Probably, for most users at least, even if not 100% accurate? What about negative progress()? That starts to be quite awkward, most users will not expect this, since this represents a percentage, so they probably won't handle negative values and the library will cause errors in their code. And where do we draw a line here? Can we format the user's HDD when playing an animation and say "a yes, you lost everything, but there's a test that validates that behavior so you should have expected it!"? It's obviously exaggerated, but to illustrate the point that there's no more rule on what the test validating.

With intended testing, you test what the behavior should be, as imagined by whoever wrote the code. If something is found wrong in practice, you either fix it now, or if not possible you log a bug and comment out the test assert!() so the rest of the test(s) continue to check the other behaviors and produce some value, and you leave the bug reference (link) in comment. Users browsing the test code see that there's a bug, both because the assert is commented out and because there's a link to the bug, and they can follow its evolution. They can find another reference to the bug later in the CHANGELOG once it's fixed, and know the behavior has been restored.

Also, small digression, but on that topic we should have a clean main branch with only expected behavior. The "fixing in next PR" that led to #42 is a surefire way to end up with a bug at release time because either whoever was going to fix it forgot, or whoever releases do so "too fast" without knowing the fix was due.

djeedai avatar Sep 26 '22 20:09 djeedai