fuzzbench
fuzzbench copied to clipboard
Effects of the collision problem in AFL (and variants thereof) on the coverage evaluations
Hi everybody,
I wonder to what extent the branch collision problem in AFL (and most of its variants) affects the code coverage evaluations of FuzzBench. If I understand correctly, the branch collision problem causes AFL to reject coverage-increasing inputs, i.e. they are not added to the input queue. This also means that FuzzBench cannot capture these inputs, which can lead to inaccurate coverage results and thus biased evaluations, especially when comparing AFL-based fuzzers to non-AFL-based fuzzers.
In the paper "CollAFL: Path Sensitive Fuzzing", the authors say that for certain programs, up to 75% of the edges could collide with others, which in my eyes would lead to a large bias in your evaluations.
Do you have any numbers on how big the deviations are and thus whether this is negligible?
Best regards, Stephan
I wonder to what extent the branch collision problem in AFL (and most of its variants) affects the code coverage evaluations of FuzzBench
It's a good question. I don't know the extent. It's worth pointing out though that one of the top fuzzers is AFL++ (which I believe still has collisions, maybe @vanhauser-thc knows if they have been mitigated).
If I understand correctly, the branch collision problem causes AFL to reject coverage-increasing inputs, i.e. they are not added to the input queue
This is also my understanding.
In the paper "CollAFL: Path Sensitive Fuzzing", the authors say that for certain programs, up to 75% of the edges could collide with others, which in my eyes would lead to a large bias in your evaluations.
I haven't read this paper but I don't have reason to doubt it. Note though that 75% of edges having collisions doesn't directly illustrate the size of this problem. For example, we don't know how many of those edges are dead. It's also worth pointing out that AFL considers hit counts (using bucketing) so it's possible that testcases that cover new edges that collide with previously seen ones can be added to the corpus. (not sure if these two points are addressed in the paper)
Do you have any numbers on how big the deviations are and thus whether this is negligible?
I don't have the numbers on this. I don't really know if it's negligible and I appreciate you bringing this to my attention. I'm not sure there is a better way to measure coverage though. It's also worth considering whether AFL should be penalized for this heuristic. This heuristic hurts fuzzing in many real world scenarios (such as sharing a corpus between parallel instances of a fuzzer, or fuzzing and then restarting fuzzing).
Finally, I'm not convinced though that this is leading to "wrong" results.
I don't have a single report combining data from our crash based experiments but their results are pretty similar to our coverage based ones. Compare them (example) to our coverage based experiment, you will see that the AFL based fuzzers still do worse than the non-AFL based fuzzers (with the exception of entropic, not clear what's going on there). So I don't really think it's likely that this bias against AFL is leading to FuzzBench to penalize rankings of AFL based fuzzers in a way that is significant.
afl++'s default instrumentation does not have branch collisions. it is collision free. this is achieved by a dynamic map size. afl++'s setup on fuzzbench of course uses collision free coverage. libfuzzer, honggfuzz and afl++ all do this, afaik none of the other fuzzers do though.
Hi @jonathanmetzman and @vanhauser-thc,
Thanks for your responses - this is something that came to my mind lately, and I was wondering if you have any numbers showing what impact - if any - the path collision problem has on your coverage sampling approach.
I'm also running larger fuzzing evaluations with a custom LLVM-based coverage profiler on my end. I will probably take a closer look into this problem in the near future. I'll keep you posted on what the results are.