fuzzbench
fuzzbench copied to clipboard
Experiment Announcement Thread
New experiments will be announced on this issue as discussed on #205. I think this is a good place to discuss experiments as well.
https://www.fuzzbench.com/reports/2020-04-20-aflplusplus/index.html contains afl++ variants. The experiment is currently running, most of the trials have completed, though some started running earlier today, and coverage measurement still needs to finish.
I just started another experiment https://www.fuzzbench.com/reports/2020-04-21/index.html that includes every fuzzer except for the afl++ variants benchmarked above.
Once both of these experiments complete I will make a combined report.
Note that coverage measuring for the previous experiment https://www.fuzzbench.com/reports/2020-04-14/index.html didn't complete because of a bug I have fixed.
@jonathanmetzman when do you expect the 2020-04-20-aflplusplus one to be finished with the measurement? and is the title then updated to not mention "incomplete"?
no need to combine this with 2020-04-21 btw, this will just make the overall report graphs unreadable with so many entries :)
@jonathanmetzman when do you expect the 2020-04-20-aflplusplus one to be finished with the measurement? and is the title then updated to not mention "incomplete"?
Title gets updated automatically. Hard to say when it will complete, I expect a day or two from now. The recent fixes we made and removal of irssi should reduce measurement time considerably. Unfortuantely, afl++ experiment started before that.
no need to combine this with 2020-04-21 btw, this will just make the overall report graphs unreadable with so many entries :)
OK sounds good to me. Let me know if you want just the AFL++ fuzzers then (I can do that as well).
OK sounds good to me. Let me know if you want just the AFL++ fuzzers then (I can do that as well).
that would be good however the new run does not have irssi target in there ... does that work and not f*ck up the benchmark calculation?
I can exclude that benchmark.
https://www.fuzzbench.com/reports/2020-04-20-aflplusplus/index.html contains afl++ variants. The experiment is currently running, most of the trials have completed, though some started running earlier today, and coverage measurement still needs to finish.
I just started another experiment https://www.fuzzbench.com/reports/2020-04-21/index.html that includes every fuzzer except for the afl++ variants benchmarked above.
Once both of these experiments complete I will make a combined report.
Both of these experiments have finished measuring. I think we are going to prioritize speeding up measuring.
OK sounds good to me. Let me know if you want just the AFL++ fuzzers then (I can do that as well).
that would be good however the new run does not have irssi target in there ... does that work and not f*ck up the benchmark calculation?
Made this combination report: https://www.fuzzbench.com/reports/2020-04-21-and-20-aflplusplus/index.html It contains all benchmarks (except woff and irssi) and all afl++ based fuzzers from 2020-04-21 and 2020-04-20-aflplusplus
@jonathanmetzman thanks! whenever you can start the next batch - that one will be very interesting, especially those with increased map sizes :)
@jonathanmetzman thanks! whenever you can start the next batch - that one will be very interesting, especially those with increased map sizes :)
Not exactly sure when this will be but probably by Friday, I'm trying to work on some long term improvement and planning this week.
I started experiments for AFL++ and fastcgs with and without huge page tables.
I started experiments for AFL++ and fastcgs with and without huge page tables.
Experiments are done. https://www.fuzzbench.com/reports/2020-05-01-fastcgs/index.html compared versions of fastcgs with support for huge page tables and without. CC @alifahmed
https://www.fuzzbench.com/reports/2020-05-01-aflplusplus-1/index.html and https://www.fuzzbench.com/reports/2020-05-01-aflplusplus-2/index.html compared afl++ variants.
I'm starting to work on using resources more intelligently in experiments (particularly for fuzzer variants or features that are in development). The next experiments I will test using preemptible instances for trials. The differences you will see here are:
- Experiments will be 23 hours (since preemptible instances cannot last longer than 24 hours and we need time to start up).
- Some (generally, 5-15% according to the docs) trials will not complete. Assuming that 3 trials do not complete for a fuzzer-benchmark, it only means comparisons between it and other fuzzers will be slightly less significant. We may be able to fix this in the future by restarting preempted trials, but for now we won't.
Running a new experiment with the main fuzzers and the new benchmark. The experiment is using preemptible VMs https://www.fuzzbench.com/reports/2020-05-04-preempt-new-bench/index.html
EDIT:
I'll discuss the results from that experiment here to reduce spam on this thread.
https://www.fuzzbench.com/reports/2020-05-11/index.html
A 15 trial full experiment: https://www.fuzzbench.com/reports/2020-05-24/index.html
20 trial experiment comparing aflplusplus_optimal, aflplusplus_shmem, and ankou (with a buggy integration): https://www.fuzzbench.com/reports/2020-05-28/index.html
https://www.fuzzbench.com/reports/2020-06-12/index.html is an experiment with libfuzzer_nocmp, aflcc, manul and afl++ (and its variants) combined with 2020-05-24.
The results of the new experiment https://www.fuzzbench.com/reports/2020-07-17/index.html currently look very different to previous runs. What has changed? is it using the new coverage? or are the benchmark targets changed?
was there already an assessment which of the two coverage methods is better or what the advantages/disadvantages are? thanks!
What did you notice that's different in 2020-07-17? I stopped that experiment to run 2020-07-13 (the AFL++ experiment I was supposed to run for you a few days ago, but didn't accidentally because of a bug in the service code, 2020-07-13 running right now).
We're only part way in 2020-07-17 but I ran it to make sure that the results are the same as usual. 2020-07-17 was run using https://github.com/google/fuzzbench/pull/509 which will allow non-FuzzBench maintainers to easily contribute benchmarks from OSS-Fuzz. In theory the results should be the same even though the builds are different. But any differences are very helpful for me.
I moved 2020-07-17 here since I don't consider it an official experiment.
was there already an assessment which of the two coverage methods is better or what the advantages/disadvantages are? thanks!
Note that 2020-07-17 used sancov, the current cov implementation. Only clang-cov-test used it. So far it looks totally fine to replace sancov with clang cov. But we're still investigating.
What did you notice that's different in 2020-07-17?
honggfuzz is not in the top list and instead fastcgs_lm is ... both unusual. Also aflplusplus_ctx_nozerosingle has no reason to be at place 3 and should rather be around ankou. sure it is only running for 1/3 of the time, but usually there are not dramatic improvements or degration over the whole benchmark after 6 hours.
Also 2020-07-17 is not updated anymore since moved
Also 2020-07-17 is not updated anymore since moved
I stopped that experiment to run yours.
There is a very noticeable bump in edge coverage between all the reports before 07/25 and after 08/03.
https://www.fuzzbench.com/reports/2020-07-25/index.html https://www.fuzzbench.com/reports/2020-08-03/index.html
I am wondering what caused it?
true! and systemd has much less now which is weird
There is a very noticeable bump in edge coverage between all the reports before 07/25 and after 08/03.
https://www.fuzzbench.com/reports/2020-07-25/index.html https://www.fuzzbench.com/reports/2020-08-03/index.html
I am wondering what caused it?
We moved to using clang code coverage instead of sancov.
what is measured now? edges? basic blocks? lines of codes? instructions?
what is measured now? edges? basic blocks? lines of codes? instructions?
It is called regions in clang code coverage. https://llvm.org/docs/CoverageMappingFormat.html#id14. It is character level precision, e.g. multiple ones in just one line “return x || y && z”.
Thanks! I think the report graphs should indicate this change to avoid confusion. Currently the graphs still says "Reached edge coverage". Should it not be something like region/code coverage?
Thanks! I think the report graphs should indicate this change to avoid confusion. Currently the graphs still says "Reached edge coverage". Should it not be something like region/code coverage?
I think we discussed it a little internally, i think some people felt that just region can be confusing and community is more accustomed to edges for coverage. @lszekeres - can you revisit and fix this confusion sometime next week ?