elasticsearch
elasticsearch copied to clipboard
Support inline stats
We'd like the ability to enrich rows with the results of a STATS
command. For example, say you have this:
a | b |
---|---|
1 | 10 |
1 | 20 |
2 | 20 |
2 | 15 |
running INLINESTATS MIN(b) BY a
should make
a | b | MIN(b) |
---|---|---|
1 | 10 | 10 |
1 | 20 | 10 |
2 | 20 | 15 |
2 | 15 | 15 |
Pinging @elastic/es-analytical-engine (Team:Analytics)
We're planning to implement this by splitting the query into two commands internally, then running them in a row. So, this:
FROM foo
| INLINESTATS MIN(b) BY a
| WHERE b == MIN(b)
becomes these two:
FROM foo
| STATS MIN(b) BY a
----and----
FROM foo
| LOOKUP _inlinestatsresults_ ON a
| WHERE b == MIN(b)
That LOOKUP
command is being implemented here: #107987. It's my first step to getting this in.
The main limitation of this approach is that the results of the first STATS
result has to be small enough you can push it across the wire. That's going to be true in plenty of cases, but at some point we'll need to deal with results that don't fit. Those cases are harder and probably all want some ability to spill to disk.
Little update as we're closing in on doing this: we plan to limit queries to a single INLINESTATS
for now.
We plan to have just the single approach that phases the results. This approach is the best when processing lots of data - you can compare individual docs to the AVG
of some values of a bucket they fall into. That's good. But if, say, you have STATS | INLINESTATS
then this query plan is silly - STATS
results have to be buffered already - we should read from STATS
twice. But we have to start somewhere and this case seems less important than the case where you start with an INLINESTATS.
There's also a limit on the amount of space the results of the first phase can have - at least in the proposal we're starting with. That will always be some limit, but for now the limit is 1mb. We'd like to lift that limit, but it's similar to the limit on the space in the TABLES
parameter.