elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

Support inline stats

Open nik9000 opened this issue 10 months ago • 3 comments

We'd like the ability to enrich rows with the results of a STATS command. For example, say you have this:

a b
1 10
1 20
2 20
2 15

running INLINESTATS MIN(b) BY a should make

a b MIN(b)
1 10 10
1 20 10
2 20 15
2 15 15

nik9000 avatar Apr 17 '24 20:04 nik9000

Pinging @elastic/es-analytical-engine (Team:Analytics)

elasticsearchmachine avatar Apr 17 '24 20:04 elasticsearchmachine

We're planning to implement this by splitting the query into two commands internally, then running them in a row. So, this:

FROM foo
| INLINESTATS MIN(b) BY a
| WHERE b == MIN(b)

becomes these two:

FROM foo
| STATS MIN(b) BY a

----and----
FROM foo
| LOOKUP _inlinestatsresults_ ON a
| WHERE b == MIN(b)

That LOOKUP command is being implemented here: #107987. It's my first step to getting this in.

nik9000 avatar May 01 '24 16:05 nik9000

The main limitation of this approach is that the results of the first STATS result has to be small enough you can push it across the wire. That's going to be true in plenty of cases, but at some point we'll need to deal with results that don't fit. Those cases are harder and probably all want some ability to spill to disk.

nik9000 avatar May 01 '24 16:05 nik9000

Little update as we're closing in on doing this: we plan to limit queries to a single INLINESTATS for now.

We plan to have just the single approach that phases the results. This approach is the best when processing lots of data - you can compare individual docs to the AVG of some values of a bucket they fall into. That's good. But if, say, you have STATS | INLINESTATS then this query plan is silly - STATS results have to be buffered already - we should read from STATS twice. But we have to start somewhere and this case seems less important than the case where you start with an INLINESTATS.

nik9000 avatar Jul 08 '24 12:07 nik9000

There's also a limit on the amount of space the results of the first phase can have - at least in the proposal we're starting with. That will always be some limit, but for now the limit is 1mb. We'd like to lift that limit, but it's similar to the limit on the space in the TABLES parameter.

nik9000 avatar Jul 08 '24 13:07 nik9000