feat(forge): coverage guided fuzzing & time based campaigns for invariant mode
Motivation
Using the inspector added in https://github.com/paradigmxyz/revm-inspectors/pull/255, this PR adds a corpus which can be enabled by setting a folder path for the corpus_dir invariant config. A corpus size of min_corpus_size is targeted once all entries have been mutated min_corpus_mutations times with one caveat -- if the entry is high likely to produce new finds, greater than one-third (guesstimate), the entry will continue to be a candidate for mutation. Currently, there are 5 mutations: splice combines two sequences, interleave weaves two sequences together, prefix overwrites the beginning of a sequence, suffix overwrites the end of a sequence, and mutate args selects a subset of arguments from a call in the sequence to modify.
Closes https://github.com/foundry-rs/foundry/issues/8665
Closes https://github.com/foundry-rs/foundry/issues/990 - if timeout (in secs) configured, will loop until expires instead max runs
TODO
- [x] Replay corpus at start up so the hitmap is initialized and not duplicating entries
- [x] LRU cache for corpus that flushes to disk when a seed hasn't been selected for mutation in awhile. This will be random for now as we don't have any notion of seed scheduling/priority, so it could just be a fixed-size data structure that writes to disk when an entry is evicted without the LRU component.
- [x] Add mutation to tx sequence that modifies the ABI values an reencodes it
These aren't strictly blocking but definitely near-term priorities to facilitate long-running fuzzing campaigns:
- [ ] Headless fuzzing mode with structured logging that includes cumulative run info: num of edges seen, number of failures encountered, number of sequences tested, size of in-memory corpus
- [ ] https://github.com/foundry-rs/foundry/issues/9727
- [x] https://github.com/foundry-rs/foundry/issues/990
- [ ] Support stateless fuzzing with mutations limited to ABI args
- [ ] Figure out sharing sequences from across b/w each invariant's corpus as rn they are independent. I think calling all invariants in every thread is probably the right choice here, but they need to have the same config and target/excludes
- [ ] https://github.com/foundry-rs/foundry/issues/8898
Future work
- [ ] Coverage evaluation suite. Run and collect edge coverage over time (probably at least 4 hours and repeated to account for random). Produce LCOV report and consider lines covered by Foundry w/o coverage, Echidna, Medusa but not w/ coverage.
- [ ] Add decoded ABI signature to corpus
- [ ] Seed corpus from unit test call sequences
Performance
Requires more investigation so jotting ideas I thought of
- [ ] Make corpus lookup more efficient https://github.com/foundry-rs/foundry/pull/10190#discussion_r2114069107
- [ ] Experiment with different edge sizes and maybe do something with shared memory
- [ ] SIMD coverage map like with LibAFL
Solution
PR Checklist
- [ ] Added Tests
- [ ] Added Documentation
- [ ] Breaking changes
https://github.com/foundry-rs/foundry/pull/10190/commits/b7f09d836c55e12cf6c721bf2bdb789ab86c5843 operated under the assumption that the libafl function updated the history map, but it doesn't. So we either need to revert back to the other way or iterate over the history map and set every index to the "classified" hitcount when there's new coverage
b7f09d8 operated under the assumption that the libafl function updated the history map, but it doesn't. So we either need to revert back to the other way or iterate over the history map and set every index to the "classified" hitcount when there's new coverage
I think reverting and using the other way should work here, do you see any big disadvantages using one or another?
Mainly perf considerations... But it can wait and we avoid taking a dep on libafl
added time based invariant campaigns in https://github.com/foundry-rs/foundry/pull/10190/commits/08e501aae466372947858d02dc2c9850d12d32e4
@DaniPopes could you pls check addresses changes and if good to merge? Thank you
Have we benchmarked this?
Have we benchmarked this?
It's still TBD, wanted to have it out for security researchers to start using it and provide feedback. If you strongly feel we shouldn't until we get all benchmarks then can work on it before merging, lmk. Thank you!
Well we're replacing the invariant runner by default with no fallback, I would at least check if we haven't had any regressions in performance and behavior
Well we're replacing the invariant runner by default with no fallback, I would at least check if we haven't had any regressions in performance and behavior
Fair enough, for behavior I feel like we do have many tests in CI to make sure is not changed, for performance I'll compare at min with the one used for v1.0.0 benchmarks https://github.com/devdacian/solidity-fuzzing-comparison https://github.com/grandizzy/fuzz-benchmarks Will look for others will update ticket with results.
Other tests suggested by @0xalpharush:
- we can run some of these for a couple hours and hopefully that gives some signal https://github.com/morpho-org/morpho-blue/tree/main/test/forge/invariant https://github.com/bgd-labs/aave-v3-origin/tree/v3.3.0/tests/invariants
- collect lcov and compare using https://github.com/capgelka/lcov-diff to check effectiveness
@0xkarmacoma I know you've been running some tests https://x.com/0xkarmacoma/status/1889771304787796198 if you still have the setup would you be willing to compare with this PR too? thanks, lmk!
I thought this would not be the default behavior as a user must add some configs to their foundry.toml to enable it
I thought this would not be the default behavior as a user must add some configs to their foundry.toml to enable it
@0xalpharush ah, do you mean we should keep the old behavior if let's say no corpus dir is set? rn is not doing this but can be easily done
@grandizzy I don't have time to run the comparison rn, but I was using https://github.com/ConsenSysDiligence/daedaluzz/blob/master/run-foundry.sh
Regression tests
https://github.com/grandizzy/fuzz-benchmarks
- no regression in breaking invariants
SimpleDSChiefTestcan be caught faster with coverage guided fuzzing (as it requires same sequence mutated to break it, e.g. it constantly breaks it in 21500 calls vs 450500 calls)ConstantsBytes32Testis now finding the counterexample, whereas prev was marked as failingCreateTestdoes not require special config anymore
https://github.com/devdacian/solidity-fuzzing-comparison
| Test Case | cov guided | no cov guided | v1.2.3 | v1.0.0 |
|---|---|---|---|---|
| NaiveReceiverAdvancedFoundry | 9.08ms | 11.62ms | 13.08ms | 12.66ms |
| NaiveReceiverBasicFoundry | 2.37s | 686.77ms | 675.42ms | 674.38ms |
| UnstoppableBasicFoundry | 3.85s | 3.91s | 1.55s | 1.71s |
| ProposalCryticTesterToFoundry | 3.79ms | 16.47ms | 14.87ms | 11.00ms |
| VotingNftCryticToFoundry | 13.43s | 4.32s | 4.32s | 4.51s |
| TokenSaleBasicFoundry | 19.29s | 7.66s | 7.61s | 7.95s |
https://github.com/ConsenSysDiligence/daedaluzz
- 30mins run each time for maze 0, 1, 2 / 10 mins run each time for maze 3 and 4 on 16 cores
- values shows number and which invariants broken with this PR (coverage guided, no coverage guided), v1.2.3 and v1.0.0
- running without coverage guided fuzzing finds the same scenarios as v1.2.3 and v1.0.0
- running with coverage guided fuzzing finds less scenarios for maze 3 (expected as coverage guided fuzzing should perform better on longer runs, maze 3 was run only for 10 mins)
| Maze | PR coverage guided | PR not guided | v1.2.3 | v1.0.0 |
|---|---|---|---|---|
| 0 | 13 (invariants 16, 17, 18, 19, 27, 29, 30, 34, 35, 38, 42, 5, 9) | 13 (invariants 16, 17, 18, 19, 27, 28, 29, 30, 34, 35, 42, 5, 9) | 13 (invariants 16, 17, 18, 19, 27, 29, 30, 34, 35, 38, 42, 5, 9) | 14 (invariants 16, 17, 18, 19, 27, 28, 29, 30, 34, 35, 38, 42, 5, 9) |
| 1 | 14 (invariants 12, 16, 17, 25, 26, 31, 32, 33, 38, 39, 47, 48, 5, 7) | 13 (invariants 12, 16, 17, 25, 26, 31, 32, 38, 39, 47, 48, 5, 7) | 14 (invariants 12, 16, 17, 25, 26, 31, 33, 38, 39, 46, 47, 48, 5, 7) | 13 (invariant 12, 16, 17, 25, 26, 31, 33, 38, 39, 47, 48, 5, 7) |
| 2 | 15 (invariants 14, 2, 26, 27, 29, 31, 33, 38, 39, 4, 40, 42, 43, 44, 46) | 15 (invariants 14, 2, 26, 27, 29, 31, 33, 38, 39, 4, 40, 42, 43, 44, 46) | 16 (invariants 13, 14, 2, 26, 27, 29, 31, 33, 38, 39, 4, 40, 42, 43, 44, 46) | 15 (invariants 14, 2, 26, 27, 29, 31, 33, 38, 39, 4, 40, 42, 43, 44, 46) |
| 3 | 10 (invariants 12, 16, 25, 3, 33, 36, 40, 41, 8, 9) | 13 (invariants 10, 11, 12, 16, 25, 3, 33, 36, 40, 41, 6, 8, 9) | 12 (invariants 10, 12, 16, 25, 3, 33, 36, 40, 41, 6, 8, 9) | 13 (invariants 1, 10, 12, 16, 25, 3, 33, 36, 40, 41, 6, 8, 9) |
| 4 | 15 (invariants 10, 11, 17, 22, 27, 28, 32, 33, 36, 37, 41, 43, 5, 6) | 15 (invariants 10, 11, 17, 22, 27, 28, 32, 33, 36, 37, 41, 43, 5, 6) | 15 (invariants 10, 11, 17, 22, 27, 28, 32, 33, 36, 37, 41, 43, 5, 6) | 14 (invariants 10, 11, 17, 22, 27, 28, 32, 33, 36, 37, 41, 43, 5) |
Coverage
https://github.com/morpho-org/morpho-blue
- collected coverage from 30 minutes time based campaign, with and without coverage guided:
forge coverage --mt invariant --report lcov --show-progress - same coverage results (using lcov-diff), though new coverage corpus was still generated: this could indicate longer time based campaign needed in order to see differences
- campaign resulted in more runs for not guided (less time spent in mutated corpus and writing to disk)
- coverage guided config
[profile.default.invariant]
depth = 100
timeout = 1800
corpus_dir = "corpus/"
corpus_min_size = 100
fail_on_revert = true
- not guided config
[profile.default.invariant]
depth = 100
timeout = 1800
fail_on_revert = true
| Invariant | PR coverage guided | PR not guided |
|---|---|---|
| invariantBadDebt | runs: 16860, calls: 1686000 | runs: 18050, calls: 1805000 |
| invariantBorrowShares | runs: 16435, calls: 1643500 | runs: 17152, calls: 1715200 |
| invariantMorphoBalance | runs: 16730, calls: 1673000 | runs: 17605, calls: 1760500 |
| invariantSupplyShares | runs: 12827, calls: 1282700 | runs: 13286, calls: 1328600 |
| invariantTotalSupplyGeTotalBorrow | runs: 18312, calls: 1831200 | runs: 19523, calls: 1952300 |
| invariantHealthy | runs: 11349, calls: 1134900 | runs: 12027, calls: 1202700 |
@0xalpharush @DaniPopes please see above https://github.com/foundry-rs/foundry/pull/10190#issuecomment-2980012506 Couple of things noticed while performing tests:
- No regression introduced, using defaults (coverage guided disabled) the invariant tests behave as in previous versions
- No immediate gain seen on morpho blue but was a time based campaign of only 30mins and new coverage was still produced / written on disk. Need extensive testing of feature with different code bases
- Coverage guided fuzzing is suitable for longer campaigns. Also on scenarios that needs a seq of calls in specific order (like DSChief bug) coverage guided fuzzing is more efficient (all runs produce counterexamples). Running with default / older version can sometimes miss producing a counterexample and needs longer runs to break the invariant.
I think there's a lot to be improved, but this unblocks running the fuzzer with coverage for hours on end and restarting from scratch as the corpus is cumulative.
I would merge and then work on these so we could collect analytics:
- Headless fuzzing mode with structured logging that includes cumulative run info: num of edges seen, number of failures encountered, number of sequences tested, size of in-memory corpus
- https://github.com/foundry-rs/foundry/issues/9727
yep, agree, @DaniPopes good to send?