cargo-mutants icon indicating copy to clipboard operation
cargo-mutants copied to clipboard

Tests Continue Running for Caught Mutants despite early Failures

Open ASuciuX opened this issue 1 year ago • 10 comments

The build and tests succeed without mutations. Following this, the build with mutations also succeeds, and tests for mutants are executed. The test results show instances of FAILED tests. Ideally, the process should stop at the first failure, but it persists in running all tests to completion, resulting in a significantly longer duration to complete the mutant testing.

ASuciuX avatar Dec 15 '23 01:12 ASuciuX

Yes, this is the current behavior, and it might save a lot of time if it would stop as soon as there is any failure.

As background on how tests are run:

  1. For each mutant scenario, cargo mutants invokes cargo test
  2. cargo test builds and then sequentially runs each test binary
  3. each test binary then runs all the tests within itself

The problem is at step 3: the standard Rust test framework does not stop early on failures, nor does it seem to have any option to make it stop early.

There are a few options, and maybe more:

  1. Try to get an option accepted upstream to stop early within the test framework. This would be useful in other cases, but I have a feeling it might take a while to be in a stable release.
  2. Watch the output from running the test, and metaphorically ^C it when we see a failure, like you might do if you were watching yourself. This is probably not too hard to implement, but because the test framework doesn't have a stable option for machine-readable output it would be just a heuristic match.
  3. Take an idea from Nextest (https://nexte.st/book/how-it-works.html) and invoke the test binary many times, each time running just one discovered test. Then we can stop as soon as we like. It's not totally trivial to implement but might be the best option. As that page notes, it may not work with custom test harnesses.
  4. Literally run cargo nextest, which already has a --fail-fast option, instead of cargo test. It might be good to have an option for it, but people would have to install it and perhaps set that option.

sourcefrog avatar Dec 15 '23 09:12 sourcefrog

I intended to ask for an implementation of running cargo nextest instead of cargo test as well. So, I think that would be really useful. Also, it is used on the repo I'm working on as well and has increased performance by a lot.

ASuciuX avatar Dec 15 '23 14:12 ASuciuX

Yes, this is the current behavior, and it might save a lot of time if it would stop as soon as there is any failure.

As background on how tests are run:

  1. For each mutant scenario, cargo mutants invokes cargo test
  2. cargo test builds and then sequentially runs each test binary
  3. each test binary then runs all the tests within itself

The problem is at step 3: the standard Rust test framework does not stop early on failures, nor does it seem to have any option to make it stop early.

There are a few options, and maybe more:

  1. Try to get an option accepted upstream to stop early within the test framework. This would be useful in other cases, but I have a feeling it might take a while to be in a stable release.
  2. Watch the output from running the test, and metaphorically ^C it when we see a failure, like you might do if you were watching yourself. This is probably not too hard to implement, but because the test framework doesn't have a stable option for machine-readable output it would be just a heuristic match.
  3. Take an idea from Nextest (https://nexte.st/book/how-it-works.html) and invoke the test binary many times, each time running just one discovered test. Then we can stop as soon as we like. It's not totally trivial to implement but might be the best option. As that page notes, it may not work with custom test harnesses.
  4. Literally run cargo nextest, which already has a --fail-fast option, instead of cargo test. It might be good to have an option for it, but people would have to install it and perhaps set that option.

adding a little extra info here that might be related to cargo test - on a large codebase, we've reliably been able to OOM a host with up tp 64GB of memory. we're not sure yet why/where this happens. @ASuciuX is working on getting some relevant stats like I/O of the disk - but we've also noticed the CPU (again, very high powered host with 32vCPU) hitting 100% across all 32 cores regularly.

wileyj avatar Dec 15 '23 16:12 wileyj

Is the OOM the same as we're discussing in #187? If so the short answer is don't use high -j values, start without -j and get that working reliably: I have not yet got your tree to pass its tests under cargo test at all...

I did previously have #85, which I closed due to a lack of a compelling use case, but perhaps this issue plus your tree relying on it(?) shows there is one.

Built-in one-test-at-a-time could be good too.

sourcefrog avatar Dec 15 '23 20:12 sourcefrog

Is the OOM the same as we're discussing in #187? If so the short answer is don't use high -j values, start without -j and get that working reliably: I have not yet got your tree to pass its tests under cargo test at all...

I did previously have #85, which I closed due to a lack of a compelling use case, but perhaps this issue plus your tree relying on it(?) shows there is one.

Built-in one-test-at-a-time could be good too.

the threads are finicky - and a bitcoind binary is required in many of the tests (so running multiple tests concurrently where bitcoin is triggered for a test will fail). i'm able to run tests natively with RUST_BACKTRACE=full STACKS_LOG_DEBUG=1 BITCOIND_TEST=1 cargo test --workspace -- --test-threads 1, where bitcoind is installed to /bin/bitcoind.

and i have a few nodes running cargo mutants:

  • RUST_BACKTRACE=full STACKS_LOG_DEBUG=1 BITCOIND_TEST=1 cargo mutants -f 'stackslib/src/burnchains/*.rs' -j 16 -- -- --test-threads 1
  • RUST_BACKTRACE=full STACKS_LOG_DEBUG=1 BITCOIND_TEST=1 cargo mutants -f 'stackslib/src/burnchains/*.rs' -j 24 -- -- --test-threads 1
  • RUST_BACKTRACE=full STACKS_LOG_DEBUG=1 BITCOIND_TEST=1 cargo mutants -f 'stackslib/src/burnchains/*.rs' -j 30 -- -- --test-threads 1

where each of the hosts has 32 vCPU and an absurd amount of memory (128GB).

i can confirm that the issues you've mentioned are all related (somewhat at least), @ASuciuX and i are working together on these tests and trying to find a performant way to execute this, with the end goal being something we can run against a diff as part of CI.

i was considering the idea you mentioned about running a single test at a time. would something like this suffice?

RUST_BACKTRACE=full STACKS_LOG_DEBUG=1 BITCOIND_TEST=1 cargo mutants -- -- my::test

i'm also seeing this as well: https://mutants.rs/performance.html , so i may investigate some of the options proposed here.

wileyj avatar Dec 15 '23 23:12 wileyj

we've had some good experience migrating to nextest where we can better parallelize the tests we need to run (the biggest impact we found was using nextest's partitions and archives to prevent multiple rebuilds).

i took a peek at how you're invoking cargo test, thinking it may be an interesting and simple PR - but looking over the source and considering the changes, it would take me a bit longer than just adding more cpu/mem to how we're currently invoking.

i was able to reproduce what you commented in https://github.com/sourcefrog/cargo-mutants/issues/187#issuecomment-1858543299, which led me to try the -j flag again. i'm not sure #187 is useful to be honest, since our tests are finicky due the bitcoind requirement in some tests - and it's impossible to predict what -j or --test-threads should be used on a random system (indeed, it's all dependent on how many cpu's you have).

wileyj avatar Dec 16 '23 00:12 wileyj

One other thing i found that is interesting - the disk usage fluctuates wildly. i have a 100GB disk (only host OS and git repo), and the usage diff between 600MB and 15GB. it doesn't seem to cause any issues yet, but it's curious.

wileyj avatar Dec 16 '23 00:12 wileyj

i was considering the idea you mentioned about running a single test at a time. would something like this suffice?

RUST_BACKTRACE=full STACKS_LOG_DEBUG=1 BITCOIND_TEST=1 cargo mutants -- -- my::test

That will work if my::test is enough to catch any failures in the code you're mutating; for example if you use -f to limit it to mutating one file or directory. (Really, this would be an approximation to having a smaller package whose tests are sufficient.)

One other thing i found that is interesting - the disk usage fluctuates wildly.

Again, cargo mutants isn't doing much other than copying the tree itself, so the disk usage will be whatever the build produces plus whatever your test produces. The target for your tree is 2.6GB on my mac so if you use -j6 there will be 6 copies of it and that would be 15GB. If your tests write a 1GB of temp files and your run 32 tests in parallel that's 32GB, etc. All the tempdirs are deleted at the end, so usage will go back down.

sourcefrog avatar Dec 16 '23 01:12 sourcefrog

@sourcefrog, firstly, great work on cargo-mutants! It's been instrumental in our mutation testing process. I'm collaborating with @wileyj and @ASuciuX and have been closely following this discussion.

Integrating cargo-nextest for enhanced performance seems promising. Could you shed some light on the feasibility of adapting cargo-mutants to use cargo-nextest? Are there significant technical challenges we should anticipate?

Your insights and guidance would be immensely helpful.

moodmosaic avatar Dec 28 '23 16:12 moodmosaic

I don't see any reason why adding Nextest support would be hard. We just need an arg/config enum saying what test driver to use, and then to thread it through to the code that runs the tests.

I'll probably do it in the next month or two, but if you want to do it let me know.

Let's take discussion of implementing Nextest to the more specific issue #85

sourcefrog avatar Dec 28 '23 20:12 sourcefrog