elasticsearch
elasticsearch copied to clipboard
ESQL: Implement LOOKUP, an "inline" enrich
This adds support for LOOKUP
, a command that implements a sort of
inline ENRICH
, using data that is passed in the request:
$ curl -uelastic:password -HContent-Type:application/json -XPOST \
'localhost:9200/_query?error_trace&pretty&format=txt' \
-d'{
"query": "ROW a=1::LONG | LOOKUP t ON a",
"tables": {
"t": {
"a:long": [ 1, 4, 2],
"v1:integer": [ 10, 11, 12],
"v2:keyword": ["cat", "dog", "wow"]
}
},
"version": "2024.04.01"
}'
v1 | v2 | a
---------------+---------------+---------------
10 |cat |1
This required these PRs:
- #107624
- #107634
- #107701
- #107762
- #107923
- #107894
- #107982
- #108012
- #108020
- #108169
- #108191
- #108334
- #108482
- #108696
Closes #107306
Hi @nik9000, I've created a changelog YAML for you.
I've opened this as draft because I've not added any real tests. And because stuff like:
curl -uelastic:password -HContent-Type:application/json -XPOST 'localhost:9200/_query?error_trace&pretty' -d'{
"query": "ROW a=1::LONG | LOOKUP t1 ON a | LOOKUP t2 ON v1",
"tables": {
"t1": {
"a:long": [1, 4, 2],
"v1:long": [5, 8, 9]
},
"t2": {
"v1:long": [ 5, 8, 9],
"v2:keyword": ["cat", "dog", "wow"]
}
},
"version": "2024.04.01"
}'
fails with array index out of bounds exceptions that don't make sense to me yet.
Hi @nik9000, I've updated the changelog YAML for you.
So I ran a little test:
curl -uelastic:password -XDELETE localhost:9200/test
for a in {0..99}; do
rm -f /tmp/evil
for b in {0..999}; do
echo '{"index": {}}' >> /tmp/evil
echo '{"a": 1}' >> /tmp/evil
echo '{"index": {}}' >> /tmp/evil
echo '{"a": 4}' >> /tmp/evil
echo '{"index": {}}' >> /tmp/evil
echo '{"a": 2}' >> /tmp/evil
done
echo >> /tmp/evil
echo -n "$a: "
curl -s -HContent-Type:application/json -uelastic:password -XPOST localhost:9200/test/_bulk?pretty --data-binary @/tmp/evil | grep \"errors\"
done
curl -HContent-Type:application/json -uelastic:password -XPOST localhost:9200/test/_forcemerge?max_num_segments=1
curl -HContent-Type:application/json -uelastic:password -XPOST localhost:9200/test/_refresh
Then I ran these two a bunch:
curl -uelastic:password -HContent-Type:application/json -XPOST \
'localhost:9200/_query?error_trace&pretty' \
-d'{
"query": "FROM test | EVAL v1=a+1 | STATS COUNT(v1)",
"profile": true,
"version": "2024.04.01"
}'
curl -uelastic:password -HContent-Type:application/json -XPOST \
'localhost:9200/_query?error_trace&pretty' \
-d'{
"query": "FROM test | LOOKUP t ON a | STATS COUNT(v1)",
"profile": true,
"tables": {
"t": {
"a:long": [ 1, 4, 2],
"v1:integer": [ 10, 11, 12]
}
},
"version": "2024.04.01"
}'
That's 300,000 documents - not many, but small enough to test fast on my laptop. From the profile EVAL v1=a + 1
took 1ms or about 3 nanoseconds per value. The actual addition is likely less than .3ns per operation but I guess the block accounting gets us slower. LOOKUP t ON a
took 10ms or about 33ns per value. That's not scientific, but it's believable. About 28ns of those 33ns is hashing - which isn't a huge surprise - we know we're using our slowest hash implementation. We have faster hash implementations. We just have to flip a few things around to plug them in properly.
My last push seems to have fixed double-layered lookup:
$ curl -uelastic:password -HContent-Type:application/json -XPOST 'localhost:9200/_query?error_trace&pretty' -d'{
"query": "ROW a=1::LONG | LOOKUP t1 ON a | LOOKUP t2 ON v1",
"tables": {
"t1": {
"a:long": [1, 4, 2],
"v1:long": [5, 8, 9]
},
"t2": {
"v1:long": [ 5, 8, 9],
"v2:keyword": ["cat", "dog", "wow"]
}
},
"version": "2024.04.01"
}'
{
"columns" : [
{
"name" : "a",
"type" : "long"
},
{
"name" : "v1",
"type" : "long"
},
{
"name" : "v2",
"type" : "keyword"
}
],
"values" : [
[
1,
5,
"cat"
]
]
}
This needs a bunch more tests and docs.
This is a feature that'll likely stay in experimental for a few versions regardless of how difficult it is for us to settle the external facing stuff.
I'm getting failures in NodeSubclassTests
. It'd be nice to have a little help on that one. I'm not sure what it's doing.
Costin's going to juggle the the language parts of this a fair bit.
Costin pushed a change to this branch to model the LOOKUP
as a full blown Join
that we then identify as something we can implement as a HashJoinExec
. I've pushed a few PRs that smoothed out a bunch of fun edges.
The interesting thing is that Join
is our first BinaryOperator
- it has two child operators - the rest of the ESQL query on one side and the table
provided in the parameter on the other. Everything else in ESQL is Unary
- the Driver
itself is fundamentally a unary entity at the moment.
We solve this by making HashJoinExec
unary - just like every other hash join ever implemented it prepares the hash table up front and then streams values through it. That "preparation" is just something we do at construction time. That's kind of a hack, but it works and it let's us build things. We can replace that with some kind of actual binary-operation Driver
later.
I haven't resolved serialization - it's important that the LocalRelation
s that this uses only be serialized to the data nodes:
- If they are used
- One time
Right now I just blow up if you try to serialize this. Which is a shame because one of the nice parts of HashJoinExec
is that you can execute it on the data node or coordinating node. That works, but I have to fix serialization.
There are lots of optimizations that don't work super well and will need some followup work. In particular, it looks like we can't "project" "through" a Join
and down into the plans below it. That's solveable but we don't have to do it to get this prototype in.
Progress report:
Serialization works now.
My biggest open question is this trick I do where I rework FieldAttributes
from the local relation into ReferenceAttributes
. This prevents looking up those attributes in SearchStats
. I think the right solution to this is to modify how we use SearchStats
to lookup information based on where the FieldAttribute
comes from. Or maybe put a reference to some Stats
in the FieldAttribute
itself. But, for now, converting to a ReferenceAttribute
seems to solve the problem. I'd love to hear more from experts
I'm updating tests and removing NOCOMMITs today and might finish. Maybe!
Another open question - we puts LOOKUP columns before the incoming data. I think maybe we should flip that.
Pinging @elastic/es-analytical-engine (Team:Analytics)
One more thing I noticed: we don't seem to validate that datatypes match between the
ON
column and the corresponding column from the table:
Oh neat. Yeah, that's bad.
Had a discussion with Alex - we'll revisit the QL side once this PR lands in. There are several aspects that could be improved such as:
- keeping the join fields as an expression (even if it's just Equals) instead of separate fields since this helps the planner
- adding the missing USING back in the join type (seems it was removed)
- look at the shadowing mechanism in joins
OK! I was distracted last week. But this week I'm going to try to land this one. So! First thing is to merge main into it and get the tests passing. Next I need to corral all of the line comments into a big check list. There are a lot of line comments we've said "nope, we're not doing this" and there a lot we want to do and I've lost track. So I need to make a list.
Pinging @elastic/kibana-esql (ES|QL-ui)
TODOS:
- [x] https://github.com/elastic/elasticsearch/pull/107987#issuecomment-2118394344
- [x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1587059151
- [x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1601562300
- [x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1601583844
- [x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1601564991
- [ ] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1601539235
- [x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1601531116
- [x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1601491776
- [x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1600263894
- [x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1600260873
- [x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1600255564
- [x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1600137357
- [x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1601510974
- [x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1600218810
- [x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1600198174
- [x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1600166496
- [x] https://github.com/elastic/elasticsearch/pull/107987#pullrequestreview-2059782890
I'd like to hold https://github.com/elastic/elasticsearch/pull/107987#discussion_r1601539235 until a followup, I think. Otherwise, I believe this is ready for review. I'll need to double check it myself, but I believe I've knocked out the comments I could find.
It occurred to me this morning that I'm going to need a new capability and probably a transport version. That'll be a today thing. And finding tests for it.
cc @astefan and @bpintea for awareness
OK! I've pushed the last update from a comment. I think we've extracted all of the unfinished comments into our lovely followup meta issue (https://github.com/elastic/elasticsearch/issues/109353). I'm marking this auto-merge
and beginning to land it. Thanks so much friends. 186 comments, 96 commits. Over a little more than a month. And 16 pre-PRs that had to land before that. And followups. This is a big one!
Oh darn. I thought the bwc failures were because of a release, but that doesn't make sense - we didn't have a branch cut. they were real. I fixed them and pushed.