benchmark
benchmark copied to clipboard
Null hypothesis test discrepancy
https://github.com/google/benchmark/blob/49aa374da96199d64fd3de9673b6f405bbc3de3e/tools/gbench/report.py#L212
As far as I understand, in google benchmark compare.py script we set:
- The null hypothesis as the baseline and the contender to have the same results.
- The alternative hypothesis is the contrary, that they are different.
In a null hypothesis we expect:
A null hypothesis to be rejected when the pvalue of the contender results in the baseline distribution is less than the alpha value (alpha = 1-confidence level).
However, in the source code we determine that the results are different when pvalue >= alpha. This is the reverse of a what we are looking for.
On a similar question for performance level, we want one side hypothesis test, since getting less execution time than the average should not be considered a failure.
I wonder if I am missing something here, I also wonder why using mannwhitneyu instead of a normal distribution since mannwhitneyu results are good for small numbers with a larger N.
Presumably you have read #593?
@LebedevRI thanks for the prompt reply. I just went through the thread, great read. And I agree with your initial idea of using tdist. However, it still does not answer of the reason why in the code we reject the null hypothesis on pval >= alpha. Even in the command help string it says:
if the calculated p-value is
below this value, then the result is said to be
statistically significant and the null hypothesis is
rejected. (default: 0.0500)
Note that I am not an expert in data analysis and probability statistics, I just want to understand this.
That test is there to determine whether or not we can confidently say that two measurement sets aren't actually the same measurement set. We get large p-value when we can not say that, and small value when we can say that. So we want small p-value, and that is exactly what that line does.
@LebedevRI
Here is where I am confused, we set the line as red (failure) when the large pvalue (results are same) and green when results are different. It is counter-intuitive since if your benchmarks are much slower or much faster you will get a green result then. I think that green should mean that the results are statistically similar to the baseline.
Here is where I am confused, we set the line as red (failure) when the large pvalue (results are same) and green when results are different. It is counter-intuitive since if your benchmarks are much slower or much faster you will get a green result then.
Correct. That's the whole point.
I think that green should mean that the results are statistically similar to the baseline.
Do you believe the documentation is backwards to what we do?
Do you believe the documentation is backwards to what we do?
If there is a sufficient repetition count of the benchmarks, the tool can do a U Test, of the null hypothesis that it is equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample.
If the calculated p-value is below this value is lower than the significance level alpha, then the result is said to be statistically significant and the null hypothesis is rejected. Which in other words means that the two benchmarks aren't identical.
- First paragraph it says that
H0: Sx == Sy - Second paragraph says that: if p < alpha then H0 rejected
- Second paragraph at last says that: H0 rejected means that
Sx != Sy.
After reading this and running the script an user might deduce from the red color "rejected H0" and from the "green" as non rejected H0. For a new user is not unreasonable to expect: green -> no-change; red -> change.
While we can discuss which color is best for which, I think it is worth to note the color meanings in the documentation.
Not really sure where it would be reasonable to put that footnote.
@LebedevRI probably at before the last sentence at https://github.com/google/benchmark/blob/main/docs/tools.md
Patches welcomed?
Patches welcomed?
Is this a question or an statement?
A statement
Just to condense: if someone wants to improve the docs, please feel free to submit the PR.
Hi @LebedevRI, can I do this task?
I have a general idea of what this is about. I even read the #593 and now looking into tools.md for better understanding.
Sure. Here, only the documentation need to be enhanced.
Hi @LebedevRI,
I have created the PR 1624 for the enhancement of the tools.md document file.
If there is anything else to improve/add or if I missed something, do let me know.
Regards.
Many thanks to @varshneydevansh, the newly added docs in your PR looks great, you did a good job explaining how to understand the compare.py output and what it does. very clarifying. :) And thanks to @LebedevRI for keeping this open and reviewing the PR. :clap: :clap:
Nice!
Vicente, thank you. I am acquiring knowledge on the fly, thanks to the guidance provided by Roman.😄