benchmark Null hypothesis test discrepancy

https://github.com/google/benchmark/blob/49aa374da96199d64fd3de9673b6f405bbc3de3e/tools/gbench/report.py#L212

As far as I understand, in google benchmark compare.py script we set:

The null hypothesis as the baseline and the contender to have the same results.
The alternative hypothesis is the contrary, that they are different.

In a null hypothesis we expect:

A null hypothesis to be rejected when the pvalue of the contender results in the baseline distribution is less than the alpha value (alpha = 1-confidence level).

However, in the source code we determine that the results are different when pvalue >= alpha. This is the reverse of a what we are looking for.

On a similar question for performance level, we want one side hypothesis test, since getting less execution time than the average should not be considered a failure.

I wonder if I am missing something here, I also wonder why using mannwhitneyu instead of a normal distribution since mannwhitneyu results are good for small numbers with a larger N.

Sep 20 '22 16:09 vicentebolea

Presumably you have read #593?

Sep 20 '22 19:09 LebedevRI

@LebedevRI thanks for the prompt reply. I just went through the thread, great read. And I agree with your initial idea of using tdist. However, it still does not answer of the reason why in the code we reject the null hypothesis on pval >= alpha. Even in the command help string it says:

 if the calculated p-value is
                        below this value, then the result is said to be
                        statistically significant and the null hypothesis is
                        rejected. (default: 0.0500)

Note that I am not an expert in data analysis and probability statistics, I just want to understand this.

Sep 20 '22 19:09 vicentebolea

That test is there to determine whether or not we can confidently say that two measurement sets aren't actually the same measurement set. We get large p-value when we can not say that, and small value when we can say that. So we want small p-value, and that is exactly what that line does.

Sep 20 '22 20:09 LebedevRI

@LebedevRI

Here is where I am confused, we set the line as red (failure) when the large pvalue (results are same) and green when results are different. It is counter-intuitive since if your benchmarks are much slower or much faster you will get a green result then. I think that green should mean that the results are statistically similar to the baseline.

Sep 20 '22 20:09 vicentebolea

Here is where I am confused, we set the line as red (failure) when the large pvalue (results are same) and green when results are different. It is counter-intuitive since if your benchmarks are much slower or much faster you will get a green result then.

Correct. That's the whole point.

I think that green should mean that the results are statistically similar to the baseline.

Do you believe the documentation is backwards to what we do?

Sep 20 '22 20:09 LebedevRI

Do you believe the documentation is backwards to what we do?

If there is a sufficient repetition count of the benchmarks, the tool can do a U Test, of the null hypothesis that it is equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample.

If the calculated p-value is below this value is lower than the significance level alpha, then the result is said to be statistically significant and the null hypothesis is rejected. Which in other words means that the two benchmarks aren't identical.

First paragraph it says that H0: Sx == Sy
Second paragraph says that: if p < alpha then H0 rejected
Second paragraph at last says that: H0 rejected means that Sx != Sy.

After reading this and running the script an user might deduce from the red color "rejected H0" and from the "green" as non rejected H0. For a new user is not unreasonable to expect: green -> no-change; red -> change.

While we can discuss which color is best for which, I think it is worth to note the color meanings in the documentation.

Sep 20 '22 20:09 vicentebolea

Not really sure where it would be reasonable to put that footnote.

Sep 20 '22 21:09 LebedevRI

@LebedevRI probably at before the last sentence at https://github.com/google/benchmark/blob/main/docs/tools.md

Sep 21 '22 01:09 vicentebolea

Patches welcomed?

Sep 21 '22 12:09 LebedevRI

Patches welcomed?

Is this a question or an statement?

Sep 21 '22 16:09 vicentebolea

A statement

Sep 21 '22 20:09 LebedevRI

Just to condense: if someone wants to improve the docs, please feel free to submit the PR.

Oct 25 '22 22:10 LebedevRI

Hi @LebedevRI, can I do this task?

I have a general idea of what this is about. I even read the #593 and now looking into tools.md for better understanding.

Jul 02 '23 06:07 varshneydevansh

Sure. Here, only the documentation need to be enhanced.

Jul 02 '23 11:07 LebedevRI

Hi @LebedevRI,

I have created the PR 1624 for the enhancement of the tools.md document file.

If there is anything else to improve/add or if I missed something, do let me know.

Regards.

Jul 06 '23 23:07 varshneydevansh

Many thanks to @varshneydevansh, the newly added docs in your PR looks great, you did a good job explaining how to understand the compare.py output and what it does. very clarifying. :) And thanks to @LebedevRI for keeping this open and reviewing the PR. :clap: :clap:

Jul 10 '23 16:07 vicentebolea

Nice!

Jul 10 '23 16:07 LebedevRI

Vicente, thank you. I am acquiring knowledge on the fly, thanks to the guidance provided by Roman.😄

Jul 10 '23 17:07 varshneydevansh

benchmark benchmark copied to clipboard

Null hypothesis test discrepancy

benchmark
benchmark copied to clipboard