fuzzbench Maybe replace scipy.stats.mannwhitneyu?

I don't know if this issue was considered when implementing the analysis module, but using scipy.stats to compute Mann-Whitney U is not ideal.

I first noticed the warning on the documentation page, not to use with sample sizes less than 20. I then found the open pull request that seems to be completely abandoned and a warning from one of the maintainers that "the current implementation is pretty badly broken and needs a complete rewrite I'm afraid".

There's a lot of controversy about using null-hypothesis significance tests in general, but I'm not attempting to bring in that debate. I just wanted to suggest that if Fuzzbench is using Mann-Whitney U, then it should use an implementation that computes p-values with the exact distribution since the sample sizes are generally <= 20.

Unfortunately I don't have a good suggestion as a replacement. I thought the wilcox.test in R would be good replacement, but even that implementation only uses the exact method when the sample sizes are less than 50 and there are no duplicates anywhere in the data. I would expect most data sets to have duplicates, either from one fuzzer getting the same coverage in two trials or from different fuzzers achieving the exact same coverage.

The Vargha Delaney A measure has been used and recommended in fuzzing research, but doesn't really replace Mann-Whitney as a statistical test concerning the distributions.

Does this matter for Fuzzbench data? I did some quick analysis on 2020-11-01 experiment just to check Scipy versus R. I found a couple changes in the significance level around the 0.01 level, but I did not find changes at the 0.05 level.

Here's one example below. With scipy.stats, the significance level is 0.05 since the p-value is greater than 0.01, but with the exact method in R, the p-value is 0.009. There's certainly not a large delta between the values, but changes the significance level due to the cutoff values. In this instance, there are no duplicate values, so R is using the exact method to determine the p-value.

first compute using Scipy.stats

>>> df = exp_snapshot_df
>>> app_fast_v2 = sorted(df[(df.fuzzer =='aflplusplus_fast_v2') & (df.benchmark == 'freetype2-2017')].edges_covered)
>>> app = sorted(df[(df.fuzzer =='aflplusplus') & (df.benchmark == 'freetype2-2017')].edges_covered)
>>> app_fast_v2
[25226, 25648, 25783, 25936, 26108, 26227, 26261, 26325, 26425, 26434, 26563, 26598, 26644, 26689, 26715, 26932, 27075, 27089, 27173, 27511]
>>> app
[24970, 25301, 25323, 25563, 25578, 25617, 25721, 25779, 25824, 25856, 25910, 26052, 26210, 26304, 26310, 26463, 26590, 26629, 26840, 26950]
>>> import scipy.stats as ss
>>> ss.mannwhitneyu(app_fast_v2, app, alternative='two-sided')
MannwhitneyuResult(statistic=295.0, pvalue=0.010581211443165648)

Now compute MW in R.

> app_fast_v2 = c(25226, 25648, 25783, 25936, 26108, 26227, 26261, 26325, 26425, 26434, 26563, 26598, 26644, 26689, 26715, 26932, 27075, 27089, 27173, 27511)
> app = c(24970, 25301, 25323, 25563, 25578, 25617, 25721, 25779, 25824, 25856, 25910, 26052, 26210, 26304, 26310, 26463, 26590, 26629, 26840, 26950)
> wilcox.test(app_fast_v2, app, alternative="two.sided", exact=TRUE)

        Wilcoxon rank sum test

data:  app and app_fast_v2
W = 105, p-value = 0.009484
alternative hypothesis: true location shift is not equal to 0

Dec 04 '20 22:12 wideglide

@wideglide Just wanted to let you know that I am reviving the SciPy PR you mentioned. Hopefully, it will be ready for SciPy 1.7.0, but that won't be available until the summer. Whenever it finishes, it will be nicely vectorized and it will do the exact p-value calculation, but I'm not sure if that will be valid for ties. If you find an implementation that says it does an exact test that's corrected for ties, please let me know.

Dec 14 '20 00:12 mdhaber

Does this matter for Fuzzbench data? I did some quick analysis on 2020-11-01 experiment just to check Scipy versus R. I found a couple changes in the significance level around the 0.01 level, but I did not find changes at the 0.05 level.

Thanks @wideglide for the analysis and @mdhaber for the heads up! Yes, a bit better implementation of Mann-Whitney U would be appreciated.

Dec 14 '20 01:12 lszekeres

gh-4933 was merged, so the new and improved mannwhitneyu should be released with SciPy 1.7 (probably late June-ish). I'd appreciate your thoughts if you have a chance to check it out before release!

It does not consider ties in the exact null distribution, but it will produce p-values based on the exact null distribution derived without ties. These should be conservative (increased Type II error, reduce Type I error) because ties tend to reduce the variance (see Hollander and Wolfe). For (very) small samples, gh-13899 would be a way to get tie-corrected exact p-values, and for larger samples the asymptotic approximation should be a good approximation. Maybe someday we can implement an exact null distribution considering ties for MWU, but for this PR I just wanted to fix the bugs and respond to some other feature requests.

May 11 '21 19:05 mdhaber

app = [24970, 25301, 25323, 25563, 25578, 25617, 25721, 25779, 25824, 25856, 25910, 26052, 26210, 26304, 26310, 26463, 26590, 26629, 26840, 26950]

app_fast_v2 = [25226, 25648, 25783, 25936, 26108, 26227, 26261, 26325, 26425, 26434, 26563, 26598, 26644, 26689, 26715, 26932, 27075, 27089, 27173, 27511]

from scipy.stats import mannwhitneyu

mannwhitneyu(app_fast_v2, app, alternative='two-sided')
Out[4]: MannwhitneyuResult(statistic=295.0, pvalue=0.010581211443165648)

mannwhitneyu(app_fast_v2, app, method='exact', alternative='two-sided')
Out[5]: MannwhitneyuResult(statistic=295.0, pvalue=0.009483607191205003)

The documentation explains how you can get the other U-statistic (105) from the statistic the function returns (for backwards compatibility) and the size of the samples.

May 11 '21 19:05 mdhaber

fuzzbench fuzzbench copied to clipboard

Maybe replace scipy.stats.mannwhitneyu?

fuzzbench
fuzzbench copied to clipboard