new_edge and new_features stat may be incorrect (e.g. -max_len is used or -use_value_profile)
-
If we fuzz with some
-max_lenvalue, theinitial_edge_coveragestat will not represent the actual edge coverage of the full corpus, because libFuzzer will ignore the contents past the firstmax_lenbytes in every input. -
During the merge, we reasonably remove
-max_lenargument, in order to avoid truncating the corpus and losing the coverage we have. -
The resulting
edge_coverageis extracted from the merge log and might be greater than theinitial_edge_coverageeven if we didn't generate any inputs, just because-max_lencaused the initial coverage to be incomplete. -
The resulting
new_edgemetric is calculated as the difference between these two (excluding the corpus subset strategy, as in that case we were able to realize the very same problem and fixed it some time ago):
https://github.com/google/clusterfuzz/blob/649d43cbd729acfd8a6342ad078b23fd191b9869/src/python/bot/fuzzers/libFuzzer/stats.py#L291
I don't have a good solution in mind. We can skip new_edge metric for random max length strategy, but that would mean we lose a signal for the strategy.
My best idea so far is to make libFuzzer's merge to give us all the numbers (i.e. initial coverage + new coverage), but that'll complicate libFuzzer's interface, so I don't really like this either.
UPD: we don't use value profiling during merge, that means we might be missing some features.
UPD 2: One more thing that came up during discussion with @jonathanmetzman today. With this new merge thing, we might want to do gsutil cp and gsutil rm after every fuzzing run, to essentially "prune" the corpus. That way, the recurring corpus pruning task would have much less of a load and would be more like a clean up / stats collection task, rather than a real heavy corpus pruning job.
Action Items
- [x] 1. Modify libFuzzer's merge process so that it could provide the accurate stats. Done in https://reviews.llvm.org/D66107.
- [x] 2. Switch ClusterFuzz's launcher code to use the new merge process.
- [x] 3. Monitor the new behavior, debug and fix any failures.
- [ ] 4. Revisit old code for parsing merge stats (and other stats overlapping with these), consider removing (e.g.
merge_edge_coveragecolumn), sha1 check for corpus filenames. - [ ] 5. Rewrite the
_is_multistep_merge_supportedusing a less hacky approach (check help message). - [ ] 6. Update Android test binaries: #1732.
@mbarbella-chromium @mukundv-chrome for the short term, should we exclude random max length strategy? What'll be the best option for your ongoing experiment(s)?
Another bad idea is to do an extra run with -runs=0 on the full corpus to get the real initial_edge_coverage. We can do it only for the runs where random max length is used, but that will be quite expensive and unwise anyway :(
@mbarbella-chromium @mukundv-chrome for the short term, should we exclude random max length strategy? What'll be the best option for your ongoing experiment(s)?
Discussed offline. The experimental data should be good for now, so I'll upload a CL to record new_edges = 0 when random max length strategy is used.
@inferno-chromium suggested a nice solution on libFuzzer side (to print stats after reading the first corpus dir), I've uploaded a CL as https://reviews.llvm.org/D66020, but we probably need to chose some other keyword. INITED is supposed to indicate that ALL inputs files are read, so in our case it's kinda PARTIALLY INITED :)
One more relevant problem: we don't use -use_value_profile=1 during merge, that means we discard some features that could've been discovered during fuzzing. That explains why new_features stat so often has non-zero values.
I'll fix it as well when making the other merge changes.
One more thing that came up during discussion with @jonathanmetzman today. With this new merge thing, we might want to do gsutil cp and gsutil rm after every fuzzing run, to essentially "prune" the corpus. That way, the recurring corpus pruning task would have much less of a load and would be more like a clean up / stats collection task, rather than a real heavy corpus pruning job.
FTR, the most recent Clang roll in Chromium should include my -merge=1 change, so that is not a blocker anymore, but there are still some other things that can delay switching to the new merge logic.
I've checked BigQuery stats for the past 15 days for both Chromium ClusterFuzz and OSS-Fuzz.
The change has definitely had an effect -- I can share stats with Googlers.
One thing that is noticeable and expected is that an average new_edges value has dropped.
Some other movements are also evident, though hard to comment on, as previously stats were incomplete as we skipped recording them for certain strategies.
As per Abhishek's suggestion I've checked whether this new implementation affected the number of post-fuzzing merge timeouts. I've checked the logs for the past 7+ days and it looks like the overall number of timeouts is not affected, which is expected, as the overall complexity of performing a merge hasn't changed much.
# Chromium ClusterFuzz, number of "Merging new testcases timed out" errors a day.
2019-12-10 : 1423
2019-12-11 : 1246
2019-12-12 : 1435 # two step merge was deployed at the end of this day
2019-12-13 : 1497
2019-12-14 : 1330
2019-12-15 : 1363
2019-12-16 : 1446
2019-12-17 : 1128 # not a full day
Re-evaluating some strategies / metrics after the change being deployed for ~2 weeks (access to the link is restricted for non-Googlers):
- DFT on OSS-Fuzz: https://docs.google.com/spreadsheets/d/14hAEuRsfofSiInUTR7KA48SbGY4QPuxHOzObjVPInQk/edit#gid=1395928105
on average, DFT strategy is still losing to the runs where it's not being used
- ML RNN on ClusterFuzz: https://docs.google.com/spreadsheets/d/13te_2ZcRQhbm25B_h2Xg8-TFV0TBn4TSSBmbTNLy2_E/edit#gid=0
ML RNN looks pretty good, it seems to be be useful, as 66.7% of fuzz targets perform better with ML RNN, while only 32% of targets perform better without it (all on average)
- MAB probabilities: https://datastudio.google.com/c/u/0/reporting/1crV7jDmoFh25j2ZMATEMa3X0HfsKqGOQ/page/QAox
The graphs seem to be broken, possibly due to https://github.com/google/clusterfuzz/issues/1304
I've ran some queries myself in https://docs.google.com/spreadsheets/d/1dNgK9xg9zkdds6ckFalNWniVI2FtVZylkfcFKR52kX0/edit#gid=0
It might be a little hard to read, but an awesome observation here is how fork strategy is showing up among a half of the top strategies since Dec 16th. Until that, fork strategy wasn't ever in top 10, just because we did not have good stats for it. ML RNN probability has also jumped for ~10-20%, but that might be a temporary fluctuation.
Once the Data Studio charts are fixed, we should be able to observe even more changes.
One more relevant problem: we don't use -use_value_profile=1 during merge, that means we discard some features that could've been discovered during fuzzing. That explains why new_features stat so often has non-zero values.
I'll fix it as well when making the other merge changes.
Has this been fixed?
One more relevant problem: we don't use -use_value_profile=1 during merge, that means we discard some features that could've been discovered during fuzzing. That explains why new_features stat so often has non-zero values. I'll fix it as well when making the other merge changes.
Has this been fixed?
I think so: https://github.com/google/clusterfuzz/pull/1915