Benchmarks: Aggregate results
For each suite, we should do something akin to Clickbench, which computes a shifted ratio for each query:
(new_value + 10ms) / (baseline_value + 10ms)
Notably, the baseline value in Clickbench is the fastest time for that single query across all engines; we'd just want it to be the value from develop as baseline_value with PR value as new_value.
We can then aggregate each engine-suite pair (e.g., TPC-H on NVMe with duckdb) in the same way as Clickbench, which takes the geometric mean of those ratios.
Alternatively (and much easier since GH does the diffing vs baseline), we could also just take the geometric mean of query runtimes per engine-suite pair.
We have added that recently