fuzzbench icon indicating copy to clipboard operation
fuzzbench copied to clipboard

Variance in FuzzBench trials with same random seed

Open jonathanmetzman opened this issue 5 years ago • 31 comments

@vanhauser-thc wrote here:

the experiment has started => https://www.fuzzbench.com/reports/2020-08-12/index.html and we can see the aflplusplus_same1 .. aflplusplus_same3 instances to have quite a different coverage.

for several benchmarks this will even out as their overage saturate after 6 hours, so capturing this now is important.

most easy to see for curl, re2 and sqlite3:

fuzzer fuzzer1 fuzzer2 fuzzer3
curl 17148.0 17161.5 17155.0
re2 3461.0 3455.0 3488.0
sqlite3 26895.0 26611.5 26420.5

as all PRNG values are the same and input selection based on same values this shows that the instances are running with different speeds.

jonathanmetzman avatar Aug 12 '20 17:08 jonathanmetzman

I ran the three examples you mention on my local machine. Here is curl: diverge On the left we have re2 and on the right we have sqlite Screenshot from 2020-08-12 10-22-43

Though the times diverge, at the exact same time, the two different runs don't have the same number of paths.

I did everything I could think of to make runs comparable, and I used this command for each run AFL_SKIP_CPUFREQ=1 ./afl-fuzz -d -s 1234567890 -mnone -i corpus/ -o /tmp/o ./fuzz-target 2147483647 (fuzz-target was changed for different benchmarks).

I don't really have the time to investigate this so thoroughly right now but I plan on doing more when I have free time since it is interesting question. But it seems like at least some of the variance isn't really due to fuzzbench itself.

One way we plan to address this is by recording things like CPU/RAM usage and allowing fuzzers to report their own custom stats.

jonathanmetzman avatar Aug 12 '20 17:08 jonathanmetzman

Can you explain how the "random seed" is causing variance and how this is a problem? Does it cause variance in the inputs to be fuzzed?

I should note that there are many sources of non-reproducibility, and they all come down to "features or behavior outside the application's control". All the more so if third-party software is being tested for crashes. For example the following things can get in the way of reproducibility:

  • Floating-point numbers, due to differences in accuracy, rounding, and operation order. This includes parallel reductions.
  • Multithreading and dynamic task scheduling.
  • Nondeterministic sources (such as the file system or the system clock).
  • Undocumented, undefined, or implementation-dependent behavior or features.

peteroupc avatar Aug 12 '20 17:08 peteroupc

Can you explain how the "random seed" is causing variance and how this is a problem? Does it cause variance in the inputs to be fuzzed?

I don't think it is causing varying behavior. I think the problem is that setting the same random seed should ideally do the opposite. Ideally if I run afl-fuzz with the same inputs and the same random seed it should

I should note that there are many sources of non-reproducibility, and they all come down to "features or behavior outside the application's control". All the more so if third-party software is being tested for crashes. For example the following things can get in the way of reproducibility:

  • Floating-point numbers, due to differences in accuracy, rounding, and operation order. This includes parallel reductions.
  • Multithreading and dynamic task scheduling.
  • Nondeterministic sources (such as the file system or the system clock).
  • Undocumented, undefined, or implementation-dependent behavior or features.

Yes I suspect some of these are the sources for the variance we are seeing.

jonathanmetzman avatar Aug 12 '20 17:08 jonathanmetzman

To compare you should not look at the time but the number of executions. As there is always randomness in scheduling the processes that is what must be matched.

If there is a difference ... maybe there is still something I haven’t thought about.

But I would rather put my money on the not 100% stability. This will very likely lead to a different seed ranking and from there the diversity starts and grows.

vanhauser-thc avatar Aug 12 '20 20:08 vanhauser-thc

Sorry to chime in here, but do you have to provide a value for the -s param, or will it fall back to a predefined fixed seed?

  • https://github.com/AFLplusplus/AFLplusplus/blob/2.66c/src/afl-fuzz.c#L124
  • https://github.com/AFLplusplus/AFLplusplus/blob/2.66c/src/afl-fuzz.c#L285-L289
  • https://github.com/AFLplusplus/AFLplusplus/blob/2.66c/src/afl-performance.c#L36-L45

jaylinski avatar Aug 12 '20 20:08 jaylinski

To compare you should not look at the time but the number of executions. As there is always randomness in scheduling the processes that is what must be matched.

Good point!

But I would rather put my money on the not 100% stability. This will very likely lead to a different seed ranking and from there the diversity starts and grows.

IMO this is the most likely culprit. We can report stability when https://github.com/google/fuzzbench/pull/648 lands and try another experiment with fixed seeds and setting persistent executions to 1. I suspect this would boost determinism quite a bit (but maybe not enough since i assume if there is any non-determinism, seeds won't help).

jonathanmetzman avatar Aug 12 '20 20:08 jonathanmetzman

Sorry to chime in here, but do you have to provide a value for the -s param, or will it fall back to a predefined fixed seed?

  • https://github.com/AFLplusplus/AFLplusplus/blob/2.66c/src/afl-fuzz.c#L124
  • https://github.com/AFLplusplus/AFLplusplus/blob/2.66c/src/afl-fuzz.c#L285-L289
  • https://github.com/AFLplusplus/AFLplusplus/blob/2.66c/src/afl-performance.c#L36-L45

Sorry is this a question for me? I did provide a value, the command I used was AFL_SKIP_CPUFREQ=1 ./afl-fuzz -d -s 1234567890 -mnone -i corpus/ -o /tmp/o ./fuzz-target 2147483647. I initially pasted AFL_SKIP_CPUFREQ=1 ./afl-fuzz -s -d 1234567890 -mnone -i corpus/ -o /tmp/o ./fuzz-target 2147483647 into my second comment. But that was not the command I used (and AFL++ would refuse to run given that) so I corrected my comment. Apologies!

jonathanmetzman avatar Aug 12 '20 20:08 jonathanmetzman

Sorry is this a question for me? I did provide a value, the command I used was AFL_SKIP_CPUFREQ=1 ./afl-fuzz -d -s 1234567890 -mnone -i corpus/ -o /tmp/o ./fuzz-target 2147483647. I initially pasted AFL_SKIP_CPUFREQ=1 ./afl-fuzz -s -d 1234567890 -mnone -i corpus/ -o /tmp/o ./fuzz-target 2147483647 into my second comment. But that was not the command I used (and AFL++ would refuse to run given that) so I corrected my comment. Apologies!

Ah, I see. Question answered :+1:

jaylinski avatar Aug 12 '20 20:08 jaylinski

We continued some of this discussion in discord. To summarize what was learned there, @gamozolabs tested the hypothesis that it's non-determinism in fuzz targets that is causing AFL++ to behave non-deterministically. From some eyeball experiments, it seems like this is definitely not the case.

jonathanmetzman avatar Aug 12 '20 20:08 jonathanmetzman

using afl++ 2.66c or newer with llvm 9 or newer and afl-clang-fast/afl-clang-lto with the default instrumentation options produce deterministic results if the target has 100% stability. However: using -c cmplog makes it non-deterministic.

TLDR: aflplusplus_same* variants have to produce the same results if run for the same number of executions for those benchmarks that have 100% stability. These are (at least): libpng, lcms, vorbis, woff2 and not (at least): bloaty_fuzz_target,freetype2-2017,harfbuzz-1.3.2,libpcap_fuzz_both,openssl_x509,openthread-2019-12-23,re2-2014-12-09,sqlite3_ossfuzz

this is just 15 of the 21 benchmarks though

vanhauser-thc avatar Aug 12 '20 22:08 vanhauser-thc

Though the times diverge, at the exact same time,

all your three targets have below 100% stability. and this statement shows that it is very likely the cause: if it diverts at the exact same time then it means that a different seed was selected in this moment - the cause being the instability.

try with lcms, vorbis or woff2 instead, no cmplog

EDIT: which explains while these three you benchmarked also were the one in the fuzzbench run with the highest differences among the three same variants. :)

vanhauser-thc avatar Aug 12 '20 22:08 vanhauser-thc

btw is there any other fuzzer that can be configured to run determinstic? because afl + variants can't, honggfuzz cant.... maybe libfuzzer? not if they select seeds based on a rand() on a list that was generated by readdir() order or if the runtime is a factor for a seed. feels odd that this is not standard. how would you otherwise effectively test changes in a fuzzer?

vanhauser-thc avatar Aug 12 '20 22:08 vanhauser-thc

btw is there any other fuzzer that can be configured to run determinstic? because afl + variants can't, honggfuzz cant.... maybe libfuzzer? not if they select seeds based on a rand() on a list that was generated by readdir() order or if the runtime is a factor for a seed. feels odd that this is not standard. how would you otherwise effectively test changes in a fuzzer?

libfuzzer supports -seed=N. I think it is not a standard because the original AFL doesn't support it and everyone copied this.

all your three targets have below 100% stability. and this statement shows that it is very likely the cause: if it diverts at the exact same time then it means that a different seed was selected in this moment - the cause being the instability.

try with lcms, vorbis or woff2 instead, no cmplog

Right, that makes sense.

Let's look at a report of the first 15 minutes of the experiment for the fixed seed fuzzers on the stable benchmarks (so time doesn't smooth out differences in behavior by reaching a plateau). http://commondatastorage.googleapis.com/fuzzbench-reports/stable-report/index.html

None of these fuzzers shows a statistically significant difference from another on any particular benchmark, or in aggregate.

While I would be more comfortable if every benchmark looked like libpng where every trial gets within 2 edges of one another, it seems like random environment differences aren't affecting results.

Even if we don't ignore the unstable benchmarks, in the first 15 minutes: http://commondatastorage.googleapis.com/fuzzbench-reports/unstable-report/index.html the fuzzers in aggregate do pretty much the same. The "best" has a 99.91 avg normalized score, while the worst has a 99.62. This also true for the rest of the experiment https://www.fuzzbench.com/reports/2020-08-12/index.html best score is 89.06 worst is 88.58. Neither of these differences is critical according to the diagrams, and I personally wouldn't be changing fuzzers because on benchmarks one of them finds .48% more edges on average in 21 benchmarks.

Does this still seem problematic?

jonathanmetzman avatar Aug 12 '20 23:08 jonathanmetzman

maybe libfuzzer? not if they select seeds based on a rand() on a list that was generated by readdir() order or if the runtime is a factor for a seed.

IIRC libfuzzer will sort the input corpus by size by default (when -prefer_small=1), before executing it. I don't think there is a rand() problem, when using -seed. Anecdotally, libFuzzer with the same seed is generally reproducible for me (of course as long as state doesn't leak).

a1xndr avatar Aug 12 '20 23:08 a1xndr

using afl++ 2.66c or newer with llvm 9 or newer and afl-clang-fast/afl-clang-lto with the default instrumentation options produce deterministic results if the target has 100% stability. However: using -c cmplog makes it non-deterministic.

TLDR: aflplusplus_same* variants have to produce the same results if run for the same number of executions for those benchmarks that have 100% stability. These are (at least): libpng, lcms, vorbis, woff2 and not (at least): bloaty_fuzz_target,freetype2-2017,harfbuzz-1.3.2,libpcap_fuzz_both,openssl_x509,openthread-2019-12-23,re2-2014-12-09,sqlite3_ossfuzz

Yes, using the -E option as you recommended on discord allowed me to get results that looked deterministic wrt to a given seed. If the above is a problem we might be able to support experiments that stopped when the fuzzer decides to end it, so you could control for differences in execution speeds among trials. This wouldn't help with comparing different tools but it would allow for rock solid comparisons of different changes to the same tool.

jonathanmetzman avatar Aug 13 '20 00:08 jonathanmetzman

btw is there any other fuzzer that can be configured to run determinstic? because afl + variants can't, honggfuzz cant.... maybe libfuzzer? not if they select seeds based on a rand() on a list that was generated by readdir() order or if the runtime is a factor for a seed. feels odd that this is not standard. how would you otherwise effectively test changes in a fuzzer?

libfuzzer supports -seed=N. I think it is not a standard because the original AFL doesn't support it and everyone copied this.

yes I know it does. though this is not the only point of entropy. for example afl* and hongfuzz select seeds based on their runtime, where shorter is better. and the runtime is not deterministic. and honggfuzz seems to read the corpus directory with readdir without sorting (and also selecting based on runtime).

I dont know the source for libfuzzer (because I have never searched for it and the llvm-project source code is gigantic ;) so I dont know if these things are also accounted for or not.

@a1xndr thanks for the info! -prefer_small=1 also seems to be the default. do you know if libfuzzer takes the runtime of a seed into account for seed selection?

all your three targets have below 100% stability. and this statement shows that it is very likely the cause: if it diverts at the exact same time then it means that a different seed was selected in this moment - the cause being the instability. try with lcms, vorbis or woff2 instead, no cmplog

Right, that makes sense.

I will generate the rest of the targets over the day so we have a complete list of what is 100% stable and what is not. cant hurt to document that somewhere in the project :)

None of these fuzzers shows a statistically significant difference from another on any particular benchmark, or in aggregate.

Does this still seem problematic?

way too early to call. the experiment is restarted today so lets see when the run is completed. it could be that after 15 minutes or 4 hours it still within OK boundaries however it could be that after 12 or 24 hours this is not the case anymore. however now we know that we have to concentrate on those targets that are 100% stable and disregard the rest. lets see. that's what benchmark tests are for :)

even if the difference in the end is not significant, it will show us what the variance is - we will have 60 totally deterministic runs of 4+ targets each. that will show as the boundary of variance that the fuzzbench runs have. pretty sure @gamozolabs will be able to make a beautiful graph out of that!

vanhauser-thc avatar Aug 13 '20 07:08 vanhauser-thc

The complete list of 100% stability targets: jsoncpp_jsoncpp_fuzzer, lcms-2017-03-21, libpng-1.2.56, mbedtls_fuzz_dtlsclient, vorbis-2017-12-11, woff2-2016-05-06, zlib_zlib_uncompress_fuzzer

That is 1/3 which is enough IMHO for a proper analysis

vanhauser-thc avatar Aug 13 '20 10:08 vanhauser-thc

yes I know it does. though this is not the only point of entropy. for example afl* and hongfuzz select seeds based on their runtime, where shorter is better. and the runtime is not deterministic.

I think this should be the only source for entropy though @morehouse might be able to confirm.

and honggfuzz seems to read the corpus directory with readdir without sorting (and also selecting based on runtime).

I dont know the source for libfuzzer (because I have never searched for it and the llvm-project source code is gigantic ;) so I dont know if these things are also accounted for or not.

The only code used to build libFuzzer is located in llvm-project/compiler-rt/lib/fuzzer

I will generate the rest of the targets over the day so we have a complete list of what is 100% stable and what is not. cant hurt to document that somewhere in the project :)

We document details on each benchmark here. You can add stability info by editing this file

jonathanmetzman avatar Aug 13 '20 16:08 jonathanmetzman

The complete list of 100% stability targets: jsoncpp_jsoncpp_fuzzer, lcms-2017-03-21, libpng-1.2.56, mbedtls_fuzz_dtlsclient, vorbis-2017-12-11, woff2-2016-05-06, zlib_zlib_uncompress_fuzzer

That is 1/3 which is enough IMHO for a proper analysis

Funny, I guessed that zlib and jsoncpp were stable when doing my analysis last night since the performance of the three "sames" on these was exactly equal like with libpng.

jonathanmetzman avatar Aug 13 '20 16:08 jonathanmetzman

It is now running again with same1-3 https://www.fuzzbench.com/reports/2020-08-14/index.html Lets see on Monday how they fared. (I wont be on I guess, my grandmother has her 100th birthday that day)

vanhauser-thc avatar Aug 14 '20 19:08 vanhauser-thc

for the "real" analysis this would need to compare the execs_done value in the corpus/fuzzer_stats file for the six 100% stable benchmarks of all 20 runs of the three _same*

that would be 60 runs per benchmarks which will show how much CPU variance the fuzzbench setup introduces.

vanhauser-thc avatar Aug 16 '20 15:08 vanhauser-thc

@vanhauser-thc Thanks for running this experiment!

Just discussed this with Jonathan and Laszlo today, and I think the experiment strongly validates FuzzBench's statistical analysis.

The experiment was run on 23 benchmarks, with 3 trials each (same1-3), for a total of 69 experiments. On only two benchmarks (vorbis, proj4) does FuzzBench report a statistically significant difference at the 0.05 level. This is in line with what we would expect at the 0.05 level: 69 * 0.05 = 3.45 expected incorrect conclusions.

@lszekeres expressed some concern that the statistical analysis was potentially "too aggressive", since it concluded significant differences for median coverage values that were only 1% and and 0.1% apart for proj4 and vorbis, respectively. After thinking about this more, I don't think this is the case. Since we provided the same seed for each trial, variance was likely very low, which would cause the statistical test to make stronger statements about small differences in medians. So we would still expect 5% of the statistical statements to come to the wrong conclusion. If we wanted fewer wrong conclusions, we would need to look at the 0.01 level or below.

morehouse avatar Aug 20 '20 19:08 morehouse

for the "real" analysis this would need to compare the execs_done value in the corpus/fuzzer_stats file for the six 100% stable benchmarks of all 20 runs of the three _same*

that would be 60 runs per benchmarks which will show how much CPU variance the fuzzbench setup introduces.

@morehouse --- ah no. what I quoted is what I wrote previously. the 3x20 execs_done of the 6 100% stable benchmarks have to be compared. Can you please supply that data? Or point to the archive where this can be extracted? Just looking at the stats in the report is not enough. that should be obvious.

vanhauser-thc avatar Aug 20 '20 21:08 vanhauser-thc

Rereading morhouse's comment it seems to be it is unclear why the report does not answer the question about variance, at least not with a lot of effort.

what is the impact of the varied CPU resources because of system activity, docker etc.? it is the number of executions performed in the benchmark duration.

So how to measure that? exactly by this - checking the variance of number of executions performed for the benchmarks and fuzzers which are 100% deterministic, and the only fuzzers and benchmarks for this is true is aflplusplus_same 1-3 and jsoncpp_jsoncpp_fuzzer, lcms-2017-03-21, libpng-1.2.56, mbedtls_fuzz_dtlsclient, vorbis-2017-12-11, woff2-2016-05-06, zlib_zlib_uncompress_fuzzer.

You cannot use coverage - unless you perform extensive graph analysis. as the new edges found correlate on executions performed not time, you can only see use coverage for this analysis for a benchmark that still has findings at the end of the benchmark, and compare for every of the 20 runs what the difference in the end of the graphs are and try to assume what the timing difference is. That is a lot of work. and only half of the benchmarks actually have meaningful discoverage still at the end.

so - total executions performed for the named benchmarks and fuzzers is the only way to answer this question. all other analysis to answer the question of variance is ..... totally pointless and wrong :) sorry. I am open for arguments of course :)

vanhauser-thc avatar Aug 21 '20 16:08 vanhauser-thc

OK I am done with my analysis and you will not like the results.

I took the data from the google storage bucket with gsutil from gs://fuzzbench-data/2020-08-14/experiment-folders/lcms-2017-03-21-aflplusplus_same*/trial-*/corpus/corpus-archive-0093.tar.gz for the benchmarks that are 100% stable and hence have the exact same random number sequences, seed selection etc. (these are: jsoncpp_jsoncpp_fuzzer,lcms_2017_03_21,libpng_1.2.56,mbedtls_fuzz_dtlsclient,vorbis_2017_12_11,woff2_2016_05_06,zlib_zlib_uncompress_fuzzer)

First Issue - the corpus snapshots are not at the same time. even for *-0093.tar.gz it ranges between 22:59:01 and 22:59:59. That is minor however a very few fuzz runs have their last corpus snapshot at 22:24:20, e.g. gs://fuzzbench-data/2020-08-14/experiment-folders/mbedtls_fuzz_dtlsclient-aflplusplus_same2/trial-305914/corpus/corpus-archive-0091.tar.gz which can easily be seen as it is a -0091 snapshot, but it is the last one in that directory.

Then from every -0093 snapshot I extracted corpus/fuzzer_stats and from there extracted the execs_per_sec. Note that in afl++ this value is the total execs performed divided by the total run time. (in vanilla afl it is over the last 60 seconds).

well and here it gets ugly. The MIN, MEDIAN and MAX for each of the 60 (59 for json and mbedtls because the snapshot for one of each was below 0093) are: min_med_max

And plotted each benchmark as a line with sorted execs_per_second numbers: execs_per_sec

These. are. huge. differences!

vanhauser-thc avatar Aug 31 '20 15:08 vanhauser-thc

attached is the raw data that I extracted

fb.txt

vanhauser-thc avatar Aug 31 '20 15:08 vanhauser-thc

OK I am done with my analysis and you will not like the results.

I took the data from the google storage bucket with gsutil from gs://fuzzbench-data/2020-08-14/experiment-folders/lcms-2017-03-21-aflplusplus_same*/trial-*/corpus/corpus-archive-0093.tar.gz for the benchmarks that are 100% stable and hence have the exact same random number sequences, seed selection etc. (these are: jsoncpp_jsoncpp_fuzzer,lcms_2017_03_21,libpng_1.2.56,mbedtls_fuzz_dtlsclient,vorbis_2017_12_11,woff2_2016_05_06,zlib_zlib_uncompress_fuzzer)

First Issue - the corpus snapshots are not at the same time. even for *-0093.tar.gz it ranges between 22:59:01 and 22:59:59. That is minor however a very few fuzz runs have their last corpus snapshot at 22:24:20, e.g. gs://fuzzbench-data/2020-08-14/experiment-folders/mbedtls_fuzz_dtlsclient-aflplusplus_same2/trial-305914/corpus/corpus-archive-0091.tar.gz which can easily be seen as it is a -0091 snapshot, but it is the last one in that directory.

Then from every -0093 snapshot I extracted corpus/fuzzer_stats and from there extracted the execs_per_sec. Note that in afl++ this value is the total execs performed divided by the total run time. (in vanilla afl it is over the last 60 seconds).

We don't use the 93rd.

well and here it gets ugly. The MIN, MEDIAN and MAX for each of the 60 (59 for json and mbedtls because the snapshot for one of each was below 0093) are: min_med_max

And plotted each benchmark as a line with sorted execs_per_second numbers: execs_per_sec

These. are. huge. differences!

We do 20 trials per fuzzer-bencmark pair. So the difference between the min and the max isn't that important right? The interesting question is whether aflplusplus_same1 has a different distribution of execs than same2 or same3. In most cases, the median/avg execs per pair are very similar, within 5% of one another. There are some cases like bloaty where the difference is closer to 20% and I will look into those. As I explained privately when I shared how to get this data, I don't have time to look into/analyze this right now (I am moving out today and driving across the country at the end of the week). I'll investigate more and share my results when I get the chance.

jonathanmetzman avatar Aug 31 '20 16:08 jonathanmetzman

@jonathanmetzman sadly no, the median for same1, same2 and same3 are (I took the 11th entry after sorting of the 20 entries):

same1: 22490.07 same2: 17949.59 same3: 18934.58

or in other words, all executions performed in same1 are 1861353704, wheras in same3 that is 1485660644 --- that is a 20% difference. This data is from zlib, as it has the widest variance in the graph.

vanhauser-thc avatar Aug 31 '20 16:08 vanhauser-thc

Yes I'm aware that the difference is ~20% in the worst cases. As I said:

There are some cases like bloaty where the difference is closer to 20% and I will look into those.

Those are interesting and i will look into them. They are bad since one of the usecases I like most for fuzzbench is comparing two fuzers on a single benchmark and improving one. But it still seems to me that the ranking based on the results of 21 benchmarks (where on average the execs were within 5% of eachother) is sound.

jonathanmetzman avatar Aug 31 '20 16:08 jonathanmetzman

@all sorry for already writing this when there are little resources at fuzzbench to talk about this now ... overlooked the message :(

I think the impact on the overall ranking is there but small. it is affecting those which still make progress in the last 2 hours of the time, but not the others.

maybe those for those which this is the case the number of runs could be increased to 40, and for the others reduced to 10.

vanhauser-thc avatar Aug 31 '20 17:08 vanhauser-thc