fuzzbench Fuzzer report

Fuzzer report

Open wideglide opened this issue 3 years ago • 6 comments

-see commit notes

two examples attached:

Feb 16 '21 09:02 wideglide

Thanks a lot!

It looks like this is two relatively independent changes: ie. (1) the new experiment summary table, and (2) the new fuzzer oriented template. Would it make sense to have these two in separate PRs and get them reviewed separately?

Feb 17 '21 04:02 lszekeres

@lszekeres The changes are not really independent because I use the new summary table in the fuzzer template report. However, switching to the the new summary table in the default report can easily be overridden. I didn't actually remove the old summary table, I just added a new percent_summary_table method in experiment_results.py. The change to the default report template is very minor: only calling the new summary table method.

Also this partially addresses #1086 so that reports can be generated from older data.csv.gz files where crash_key is not present.

Feb 17 '21 06:02 wideglide

In reviewing results on previous experiments using the default report, I find it challenging to really get a wholistic view of how a fuzzer performs across benchmarks. I found my self scrolling between the benchmarks trying to remember relative performance on different benchmarks. I think the new summary table helps to address that weakness significantly, but a report that is oriented on fuzzers (instead of benchmarks) allows for a snapshot of each fuzzer across all benchmarks in pictorial and textual form. It is easier for someone to identify where a fuzzer is strong and where it might be weak.

Of the new charts, I find the "Trial Ranks" chart to be use most useful, but maybe not quite intuitive enough. I might need to add some additional documentation/ text to aide comprehension.

The "Rank" text shows the ranking of the fuzzer on each benchmark with the average rank listed at the top. This allows me to see the correlation of why each fuzzer achieved the overall rank that it did.
"Rank" itself is weak measure of central tendency, so the box plots of trial ranking distribution shows a visual depiction of whether the rank achieved is significant.
- Looking at the chart below, AFLplusplus ranked 1st in [curl, harfbuzz, libxml2, and freetype2] and the ranking for those benchmarks is significant because all of the individual trial rankings are above the 90th percentile.
- For benchmarks like re2, you can see that AFLplusplus still ranked 1st but that ranking is not significant because the trial ranks are very spread out.
- Likewise, on libfuzzer's chart, you can tell that it did very poorly on at least 4 benchmarks as those individual trial ranks are less than the 10th percentile.
The percent of max and fuzzer consistency are less helpful as they tend to show more about the characteristics of each benchmark more then the fuzzer, but if you were interested in how one fuzzer compared to the best performance in the experiment across all benchmarks you could see that in the percent of max chart.
The code coverage graph is probably my least favorite graph, and I'm not sure if it is worth keeping, but I saw the interest (#1078) in a comprehensive coverage graph and attempted to generate one. I think if you added all the coverages together, the resulting graph would be dominated by one or two benchmarks, and you wouldn't really know which ones where a factor.

Feb 18 '21 21:02 wideglide

One more note, because the report is organized by fuzzer in rank ascending order, you are likely to want to compare one fuzzer to the fuzzer either directly above or below it, which should mean less scrolling to find the data you are looking for if you want to compare charts.

Feb 18 '21 21:02 wideglide

Thanks for the detailed explanation, and sorry for the slow response!

Of the new charts, I find the "Trial Ranks" chart to be use most useful, but maybe not quite intuitive enough. I might need to add some additional documentation/ text to aide comprehension.

Yes, this plot can be useful, I agree. Yes, adding a detailed description how to read them under the plots would help a lot.

The percent of max and fuzzer consistency are less helpful as they tend to show more about the characteristics of each benchmark more then the fuzzer, but if you were interested in how one fuzzer compared to the best performance in the experiment across all benchmarks you could see that in the percent of max chart.

Right, I'm worried these might give more confusion than clarity. They also seem a bit redundant with the first plot (trial ranks). I'd drop these for now.

The code coverage graph is probably my least favorite graph, and I'm not sure if it is worth keeping, but I saw the interest (#1078) in a comprehensive coverage graph and attempted to generate one. I think if you added all the coverages together, the resulting graph would be dominated by one or two benchmarks, and you wouldn't really know which ones where a factor.

Yes, I find this very confusing as it compares apples to oranges (benchmarks). I'd drop this for now too.

Consider starting with the Trial Ranking Distribution chart only in this PR, and explaining in detail how to read it in the report.

Feb 28 '21 04:02 lszekeres

I went ahead and split this into two separate PRs as you originally suggested. I'll take another look at how to make this report more intuitive and useful.

Mar 01 '21 18:03 wideglide

fuzzbench fuzzbench copied to clipboard

Fuzzer report

fuzzbench
fuzzbench copied to clipboard