Statistical hypothesis testing
The stats for determining whether benchmark results are actually statistically significant is not super intuitive. It would be nice if benchmark.zig handled this for you.
I'm unsure exactly how the UX for this should work - ideally you wanna be able to compare any two benchmarks to see if they're statistically different from each other, but I'm hesitant to add an interactive TUI since that will significantly complicate the library. But maybe it could go in a standalone analyzer binary that reads the results from a CSV or something? The simplest approach is probably to just compare every pair of benchmarks and output a summary, but that's potentially very noisy, particularly for benchmarks that are unrelated to one another. Another option is to output a webpage showing the results, with a little bit of JS to do the stats for whatever null hypothesis the user wants, which also has the possibility for adding distribution graphs etc.
One common usecase is "I changed a thing, is it faster now?" In that case, the goal is to compare each benchmark to the matching result from a previous run. It's possible that simply being able to output and/or compare to a CSV file would be sufficient for most usecases.