cargo-mutants
cargo-mutants copied to clipboard
Tests Continue Running for Caught Mutants despite early Failures
The build and tests succeed without mutations. Following this, the build with mutations also succeeds, and tests for mutants are executed. The test results show instances of FAILED tests. Ideally, the process should stop at the first failure, but it persists in running all tests to completion, resulting in a significantly longer duration to complete the mutant testing.
Yes, this is the current behavior, and it might save a lot of time if it would stop as soon as there is any failure.
As background on how tests are run:
- For each mutant scenario,
cargo mutants
invokescargo test
-
cargo test
builds and then sequentially runs each test binary - each test binary then runs all the tests within itself
The problem is at step 3: the standard Rust test framework does not stop early on failures, nor does it seem to have any option to make it stop early.
There are a few options, and maybe more:
- Try to get an option accepted upstream to stop early within the test framework. This would be useful in other cases, but I have a feeling it might take a while to be in a stable release.
- Watch the output from running the test, and metaphorically ^C it when we see a failure, like you might do if you were watching yourself. This is probably not too hard to implement, but because the test framework doesn't have a stable option for machine-readable output it would be just a heuristic match.
- Take an idea from Nextest (https://nexte.st/book/how-it-works.html) and invoke the test binary many times, each time running just one discovered test. Then we can stop as soon as we like. It's not totally trivial to implement but might be the best option. As that page notes, it may not work with custom test harnesses.
- Literally run
cargo nextest
, which already has a--fail-fast
option, instead ofcargo test
. It might be good to have an option for it, but people would have to install it and perhaps set that option.
I intended to ask for an implementation of running cargo nextest
instead of cargo test
as well. So, I think that would be really useful. Also, it is used on the repo I'm working on as well and has increased performance by a lot.
Yes, this is the current behavior, and it might save a lot of time if it would stop as soon as there is any failure.
As background on how tests are run:
- For each mutant scenario,
cargo mutants
invokescargo test
cargo test
builds and then sequentially runs each test binary- each test binary then runs all the tests within itself
The problem is at step 3: the standard Rust test framework does not stop early on failures, nor does it seem to have any option to make it stop early.
There are a few options, and maybe more:
- Try to get an option accepted upstream to stop early within the test framework. This would be useful in other cases, but I have a feeling it might take a while to be in a stable release.
- Watch the output from running the test, and metaphorically ^C it when we see a failure, like you might do if you were watching yourself. This is probably not too hard to implement, but because the test framework doesn't have a stable option for machine-readable output it would be just a heuristic match.
- Take an idea from Nextest (https://nexte.st/book/how-it-works.html) and invoke the test binary many times, each time running just one discovered test. Then we can stop as soon as we like. It's not totally trivial to implement but might be the best option. As that page notes, it may not work with custom test harnesses.
- Literally run
cargo nextest
, which already has a--fail-fast
option, instead ofcargo test
. It might be good to have an option for it, but people would have to install it and perhaps set that option.
adding a little extra info here that might be related to cargo test
- on a large codebase, we've reliably been able to OOM a host with up tp 64GB of memory. we're not sure yet why/where this happens. @ASuciuX is working on getting some relevant stats like I/O of the disk - but we've also noticed the CPU (again, very high powered host with 32vCPU) hitting 100% across all 32 cores regularly.
Is the OOM the same as we're discussing in #187? If so the short answer is don't use high -j
values, start without -j
and get that working reliably: I have not yet got your tree to pass its tests under cargo test
at all...
I did previously have #85, which I closed due to a lack of a compelling use case, but perhaps this issue plus your tree relying on it(?) shows there is one.
Built-in one-test-at-a-time could be good too.
Is the OOM the same as we're discussing in #187? If so the short answer is don't use high
-j
values, start without-j
and get that working reliably: I have not yet got your tree to pass its tests undercargo test
at all...I did previously have #85, which I closed due to a lack of a compelling use case, but perhaps this issue plus your tree relying on it(?) shows there is one.
Built-in one-test-at-a-time could be good too.
the threads are finicky - and a bitcoind binary is required in many of the tests (so running multiple tests concurrently where bitcoin is triggered for a test will fail).
i'm able to run tests natively with RUST_BACKTRACE=full STACKS_LOG_DEBUG=1 BITCOIND_TEST=1 cargo test --workspace -- --test-threads 1
, where bitcoind
is installed to /bin/bitcoind
.
and i have a few nodes running cargo mutants:
-
RUST_BACKTRACE=full STACKS_LOG_DEBUG=1 BITCOIND_TEST=1 cargo mutants -f 'stackslib/src/burnchains/*.rs' -j 16 -- -- --test-threads 1
-
RUST_BACKTRACE=full STACKS_LOG_DEBUG=1 BITCOIND_TEST=1 cargo mutants -f 'stackslib/src/burnchains/*.rs' -j 24 -- -- --test-threads 1
-
RUST_BACKTRACE=full STACKS_LOG_DEBUG=1 BITCOIND_TEST=1 cargo mutants -f 'stackslib/src/burnchains/*.rs' -j 30 -- -- --test-threads 1
where each of the hosts has 32 vCPU and an absurd amount of memory (128GB).
i can confirm that the issues you've mentioned are all related (somewhat at least), @ASuciuX and i are working together on these tests and trying to find a performant way to execute this, with the end goal being something we can run against a diff as part of CI.
i was considering the idea you mentioned about running a single test at a time. would something like this suffice?
RUST_BACKTRACE=full STACKS_LOG_DEBUG=1 BITCOIND_TEST=1 cargo mutants -- -- my::test
i'm also seeing this as well: https://mutants.rs/performance.html , so i may investigate some of the options proposed here.
we've had some good experience migrating to nextest
where we can better parallelize the tests we need to run (the biggest impact we found was using nextest's partitions and archives to prevent multiple rebuilds).
i took a peek at how you're invoking cargo test
, thinking it may be an interesting and simple PR - but looking over the source and considering the changes, it would take me a bit longer than just adding more cpu/mem to how we're currently invoking.
i was able to reproduce what you commented in https://github.com/sourcefrog/cargo-mutants/issues/187#issuecomment-1858543299, which led me to try the -j
flag again. i'm not sure #187 is useful to be honest, since our tests are finicky due the bitcoind requirement in some tests - and it's impossible to predict what -j
or --test-threads
should be used on a random system (indeed, it's all dependent on how many cpu's you have).
One other thing i found that is interesting - the disk usage fluctuates wildly. i have a 100GB disk (only host OS and git repo), and the usage diff between 600MB and 15GB. it doesn't seem to cause any issues yet, but it's curious.
i was considering the idea you mentioned about running a single test at a time. would something like this suffice?
RUST_BACKTRACE=full STACKS_LOG_DEBUG=1 BITCOIND_TEST=1 cargo mutants -- -- my::test
That will work if my::test
is enough to catch any failures in the code you're mutating; for example if you use -f
to limit it to mutating one file or directory. (Really, this would be an approximation to having a smaller package whose tests are sufficient.)
One other thing i found that is interesting - the disk usage fluctuates wildly.
Again, cargo mutants isn't doing much other than copying the tree itself, so the disk usage will be whatever the build produces plus whatever your test produces. The target
for your tree is 2.6GB on my mac so if you use -j6
there will be 6 copies of it and that would be 15GB. If your tests write a 1GB of temp files and your run 32 tests in parallel that's 32GB, etc. All the tempdirs are deleted at the end, so usage will go back down.
@sourcefrog, firstly, great work on cargo-mutants! It's been instrumental in our mutation testing process. I'm collaborating with @wileyj and @ASuciuX and have been closely following this discussion.
Integrating cargo-nextest
for enhanced performance seems promising. Could you shed some light on the feasibility of adapting cargo-mutants
to use cargo-nextest
? Are there significant technical challenges we should anticipate?
Your insights and guidance would be immensely helpful.
I don't see any reason why adding Nextest support would be hard. We just need an arg/config enum saying what test driver to use, and then to thread it through to the code that runs the tests.
I'll probably do it in the next month or two, but if you want to do it let me know.
Let's take discussion of implementing Nextest to the more specific issue #85