pisa Improve `queries` tool

Key changes in this pull request:

Unifies the functionality of op_perftest() and the --extract option, so the latter has been removed. The former op_perftest() behavior is now available via the new --summary-only option.
Adds a new --runs option to specify the number of runs to measure the query set (by default: 3). Note that this parameter excludes warmup.
Adds a new --aggregate-by option with the following modes: none, min, mean, median and max. When aggregation is none (or not specified) all query results across the runs are exported, also printing the stats for all the aggregation modes.
Writes statistics summary to std:cerr, allowing to export the per-query results CSV by simply using ./queries <parameters> > per_query_times.csv.
Measures each query once per run as with the former op_perftest(), so the set of queries is evaluated independently in each run.

Nov 19 '25 03:11 gustingonzalez

@JMMackenzie do you think it makes sense to extract results after aggregation? Say, return min for each query instead of R results where R is num of runs?

I think this could be reasonable, and I do think this was what I had initially envisaged. But I can see the benefit of extracing everything to stderr, and then also allowing a separate "aggregated" stream to either file or to stdout? I agree that if there is an aggregated stream being output, then a summary should also use that same aggregation. Maybe we can make a new "results" format where we dump both file.agg.summary and file.agg.stream?

Nov 19 '25 23:11 JMMackenzie

I would actually try to avoid introducing new formats, I'd like to keep it as simple as possible, and as unsurprising as possible as well.

I think the most important is to give the user the raw data, which they can process however they want. I think we all agree on this part.

Then, I would lean towards simplicity:

I think having a single type of output (data vs summary) is much less ambiguous, so I lean towards having either one or the other. I'm thinking of the summary as a convenience during prototyping rather than actual serious data gathering.
Having the ability to aggregate for a query may be crucial for summary, and maybe useful for extracting data.
I think we should think hard what agg functions make sense, and not include those that don't. I'm struggling to find use for max or median. I think min and mean makes sense.

We will not support all types outputs from this tool but that's ok. This is why we give the user raw output.

Given the above, I would suggest the following algorithm:

results = run_benchmark(...)
if (aggregate) {
  results = aggregate_per_query(results, agg_function)
}
if (summarize) print_summary(results)
else print_data(results)

You have a few choices for implementing this.

One is having results as "rows", i.e., a vector of structs describing everything you need, including the query ID, and then aggregate_per_query can internally do group-by (could just use a hash map) and aggregate.

The other would be to have results as nested vectors and then after aggregation you still get nested vectors, only each inner vector has one element.

agg_function can be take multiple results and output multiple results as well, this way you can treat no aggregate as a type of aggregate (identity) -- maybe "transform" is a better name? I would prefer creating that transform function first (using CLI options) and then pass that and apply it for each query group, as opposed to pass name and have all the if-else conditions nested. It would be much cleaner that way, but it requires some design considerations.

print_data and print_summary don't need to know anything about how results were produced or transformed, only how to print them (or calculate stats).

Nov 20 '25 00:11 elshize

I think that, for the case, the median could provide more robustness than mean; for example, by suppressing atypical cases or noise (mostly related to the maximum values across the runs).

As for the max aggregation, I don't now if it is really useful (maybe to capture take the worst cases?), but it may ultimately not be representative; I just included it because its implementation required no additional effort. If it has no real usefulness, I think it should be removed.

Regarding the methodology for printing data or a summary, I think it is useful to show the summary when extracting. In this case, if some of the metrics satisfy the user's needs, there is no need to run an external script (for that reason I think is useful to print all defined metrics when no aggregation/transformation is specified).

Also, given that the query times are printed to the output, they can simply be exported using redirection (>). If that is confusing, a file path could be used instead, but I think redirection keeps it simple. In the case of printing just the summary (for example, for a quick check of a prototype), I agree with it not makes sense to print the query times.

Nov 20 '25 02:11 gustingonzalez

I think that, for the case, the median could provide more robustness than mean; for example, by suppressing atypical cases or noise (mostly related to the maximum values across the runs).

That's fair.

My main concern is making it too complex. How about we always extract all queries (user can process that data themselves) and always print all summaries?

I really don't want to go the route of defining aggregate function that will only apply to one or the other.

Just note that if you use stderr for summaries, we can't pipe it to another tool for transformation because redirect will capture the rest of the logs, so it will be purely informative.

Nov 21 '25 19:11 elshize

The behavior in which all runs (together with all summaries) are extracted occurs when aggregation is set to none (or not specified): in this case, the tool shows the summary with the runs aggregated by min, mean, etc.

However, another experiment could be: "I want just to know what happens when all values are the minimum". Although I can obtain this value from the summary output, if I specify aggregation by min, it makes sense that the summary and the output adapt to that scenario, so I don’t need to implement an specific script to reprocess the output data (even though I understand such a script would be simple). This can be useful for quick experiments; if I want to understand the causes behind this value, because I can quickly analyze the file that already contains all the minimum values.

In any case, I understand that this may introduce unnecessary complexity from a SRP perspective, and an intermediate option would be to remove the --aggregate-by parameter and always show all the data with the summary for the different values, as mentioned.

What do you think @elshize, @JMMackenzie?

Nov 22 '25 20:11 gustingonzalez

I personally think it's unnecessary but if you really want to only aggregate the summary, this needs to be explicitly named in a way it doesn't leave any doubt as to what it does. --aggregate-by is not good enough. --summarize-by maybe? Not sure, but must be explicit about what we're aggregating.

Nov 22 '25 20:11 elshize

As for the max aggregation, I don't now if it is really useful (maybe to capture take the worst cases?), but it may ultimately not be representative; I just included it because its implementation required no additional effort. If it has no real usefulness, I think it should be removed.

Maybe we should actually run a bunch of experiments and see if it actually matters? :grin:

By the way, if we want the summary to aggregate, why not show them all?

Per-Query Minimums
--------------------------------
> Summary data here

Per-Query Medians
--------------------------------
> Summary data here

...

I guess output would be verbose... I am usually happy to report something sensible as the default though, so we could have a median or mean summary by default, and then a --verbose-summary flag that does them all?

Nov 25 '25 06:11 JMMackenzie

By the way, if we want the summary to aggregate, why not show them all?

My understanding is that this is what @gustingonzalez suggests to do -- by default. I would say that if this is the default, then I see very little value in an additional filter to only show one.

I think we can probably come up with a succinct way of printing them out; per aggregate function, we only need mean and quantiles:

Encoding:           block_simdbp
Algorithm:          block_wand_max
Corrective reruns:  7
All runs:           [mean=1000, q50=1000 q90=2000, q99=3000]
Per query medians:  [mean=1000, q50=1000 q90=2000, q99=3000]
Per query means:    [mean=1000, q50=1000 q90=2000, q99=3000]
Per query minimums: [mean=1000, q50=1000 q90=2000, q99=3000]
...

Though this is compounded by the fact that we can define multiple algorithms. I believe this was the reason initially to print JSON lines for the statistics, so that you can capture it and parse later. But this was before we had the --extract option, and we never really thought about it much after.

I believe that anything going to stderr shouldn't really be for parsing. This is because we also print other unstructured logs, such as "warming up" or "performing queries", etc. These should be logs that tell user what is happening, and the data should go to stdout.

If we want to print summaries to stderr, I think we should either always print them, or have two options: (1) print them, (2) don't print them (with one of them being default, and the other enabled with a flag). Having additional filter for median, mean, minimum, etc., is just distracting. All it does is save up a few lines of logs but I adds significantly more complexity.

Nov 27 '25 03:11 elshize

Hi guys, sorry, I've been a bit absent.

If we don't want to use stderr, then we should recover the --extract parameter, to be used as following: --extract extracted.csv. In this case, the JSON will be always printed to the stdout, so it could be easily redirected to a file, while other outputs (for example, logs), would continue to be printed to stderr as they are now. What do you think?

On the other hand, I agree with the idea that the filters are unnecessary. We can simply output everything together, and let users filter using an external tool. This keeps the script simple, as @elshize mentioned.

Dec 19 '25 14:12 gustingonzalez