elasticsearch ESQL: Implement LOOKUP, an "inline" enrich

This adds support for LOOKUP, a command that implements a sort of inline ENRICH, using data that is passed in the request:

$ curl -uelastic:password -HContent-Type:application/json -XPOST \
    'localhost:9200/_query?error_trace&pretty&format=txt' \
-d'{
    "query": "ROW a=1::LONG | LOOKUP t ON a",
    "tables": {
        "t": {
            "a:long":     [    1,     4,     2],
            "v1:integer": [   10,    11,    12],
            "v2:keyword": ["cat", "dog", "wow"]
        }
    },
    "version": "2024.04.01"
}'
      v1       |      v2       |       a       
---------------+---------------+---------------
10             |cat            |1

This required these PRs:

#107624
#107634
#107701
#107762
#107923
#107894
#107982
#108012
#108020
#108169
#108191
#108334
#108482
#108696

Closes #107306

Apr 28 '24 11:04 nik9000

Hi @nik9000, I've created a changelog YAML for you.

Apr 28 '24 11:04 elasticsearchmachine

I've opened this as draft because I've not added any real tests. And because stuff like:

curl -uelastic:password -HContent-Type:application/json -XPOST 'localhost:9200/_query?error_trace&pretty' -d'{
    "query": "ROW a=1::LONG | LOOKUP t1 ON a | LOOKUP t2 ON v1",
    "tables": {
        "t1": {
            "a:long":     [1, 4, 2],
            "v1:long":    [5, 8, 9]
        },
        "t2": {
            "v1:long":    [    5,     8,     9],
            "v2:keyword": ["cat", "dog", "wow"]
        }
    },
    "version": "2024.04.01"
}'

fails with array index out of bounds exceptions that don't make sense to me yet.

Apr 28 '24 11:04 nik9000

Hi @nik9000, I've updated the changelog YAML for you.

Apr 28 '24 12:04 elasticsearchmachine

So I ran a little test:

curl -uelastic:password -XDELETE localhost:9200/test
for a in {0..99}; do
   rm -f /tmp/evil
   for b in {0..999}; do
      echo '{"index": {}}' >> /tmp/evil
      echo '{"a": 1}' >> /tmp/evil
      echo '{"index": {}}' >> /tmp/evil
      echo '{"a": 4}' >> /tmp/evil
      echo '{"index": {}}' >> /tmp/evil
      echo '{"a": 2}' >> /tmp/evil
   done
   echo >> /tmp/evil
   echo -n "$a:  "
   curl -s -HContent-Type:application/json -uelastic:password -XPOST localhost:9200/test/_bulk?pretty --data-binary @/tmp/evil | grep \"errors\"
done

curl -HContent-Type:application/json -uelastic:password -XPOST localhost:9200/test/_forcemerge?max_num_segments=1
curl -HContent-Type:application/json -uelastic:password -XPOST localhost:9200/test/_refresh

Then I ran these two a bunch:

curl -uelastic:password -HContent-Type:application/json -XPOST \
    'localhost:9200/_query?error_trace&pretty' \
-d'{
    "query": "FROM test | EVAL v1=a+1 | STATS COUNT(v1)",
    "profile": true,
    "version": "2024.04.01"
}'
curl -uelastic:password -HContent-Type:application/json -XPOST \
    'localhost:9200/_query?error_trace&pretty' \
-d'{
    "query": "FROM test | LOOKUP t ON a | STATS COUNT(v1)",
    "profile": true,
    "tables": {
        "t": {
            "a:long":     [    1,     4,     2],
            "v1:integer": [   10,    11,    12]
        }
    },
    "version": "2024.04.01"
}'

That's 300,000 documents - not many, but small enough to test fast on my laptop. From the profile EVAL v1=a + 1 took 1ms or about 3 nanoseconds per value. The actual addition is likely less than .3ns per operation but I guess the block accounting gets us slower. LOOKUP t ON a took 10ms or about 33ns per value. That's not scientific, but it's believable. About 28ns of those 33ns is hashing - which isn't a huge surprise - we know we're using our slowest hash implementation. We have faster hash implementations. We just have to flip a few things around to plug them in properly.

Apr 28 '24 19:04 nik9000

My last push seems to have fixed double-layered lookup:

$ curl -uelastic:password -HContent-Type:application/json -XPOST 'localhost:9200/_query?error_trace&pretty' -d'{
    "query": "ROW a=1::LONG | LOOKUP t1 ON a | LOOKUP t2 ON v1",
    "tables": {
        "t1": {
            "a:long":     [1, 4, 2],
            "v1:long":    [5, 8, 9]
        },
        "t2": {
            "v1:long":    [    5,     8,     9],
            "v2:keyword": ["cat", "dog", "wow"]
        }
    },
    "version": "2024.04.01"
}'
{
  "columns" : [
    {
      "name" : "a",
      "type" : "long"
    },
    {
      "name" : "v1",
      "type" : "long"
    },
    {
      "name" : "v2",
      "type" : "keyword"
    }
  ],
  "values" : [
    [
      1,
      5,
      "cat"
    ]
  ]
}

Apr 28 '24 19:04 nik9000

This needs a bunch more tests and docs.

This is a feature that'll likely stay in experimental for a few versions regardless of how difficult it is for us to settle the external facing stuff.

Apr 29 '24 15:04 nik9000

I'm getting failures in NodeSubclassTests. It'd be nice to have a little help on that one. I'm not sure what it's doing.

Apr 29 '24 19:04 nik9000

Costin's going to juggle the the language parts of this a fair bit.

May 01 '24 16:05 nik9000

Costin pushed a change to this branch to model the LOOKUP as a full blown Join that we then identify as something we can implement as a HashJoinExec. I've pushed a few PRs that smoothed out a bunch of fun edges.

The interesting thing is that Join is our first BinaryOperator - it has two child operators - the rest of the ESQL query on one side and the table provided in the parameter on the other. Everything else in ESQL is Unary - the Driver itself is fundamentally a unary entity at the moment.

We solve this by making HashJoinExec unary - just like every other hash join ever implemented it prepares the hash table up front and then streams values through it. That "preparation" is just something we do at construction time. That's kind of a hack, but it works and it let's us build things. We can replace that with some kind of actual binary-operation Driver later.

I haven't resolved serialization - it's important that the LocalRelations that this uses only be serialized to the data nodes:

If they are used
One time

Right now I just blow up if you try to serialize this. Which is a shame because one of the nice parts of HashJoinExec is that you can execute it on the data node or coordinating node. That works, but I have to fix serialization.

There are lots of optimizations that don't work super well and will need some followup work. In particular, it looks like we can't "project" "through" a Join and down into the plans below it. That's solveable but we don't have to do it to get this prototype in.

May 06 '24 13:05 nik9000

Progress report:

Serialization works now.

My biggest open question is this trick I do where I rework FieldAttributes from the local relation into ReferenceAttributes. This prevents looking up those attributes in SearchStats. I think the right solution to this is to modify how we use SearchStats to lookup information based on where the FieldAttribute comes from. Or maybe put a reference to some Stats in the FieldAttribute itself. But, for now, converting to a ReferenceAttribute seems to solve the problem. I'd love to hear more from experts

I'm updating tests and removing NOCOMMITs today and might finish. Maybe!

May 08 '24 13:05 nik9000

Another open question - we puts LOOKUP columns before the incoming data. I think maybe we should flip that.

May 08 '24 13:05 nik9000

Pinging @elastic/es-analytical-engine (Team:Analytics)

May 08 '24 19:05 elasticsearchmachine

One more thing I noticed: we don't seem to validate that datatypes match between the ON column and the corresponding column from the table:

Oh neat. Yeah, that's bad.

May 17 '24 16:05 nik9000

Had a discussion with Alex - we'll revisit the QL side once this PR lands in. There are several aspects that could be improved such as:

keeping the join fields as an expression (even if it's just Equals) instead of separate fields since this helps the planner
adding the missing USING back in the join type (seems it was removed)
look at the shadowing mechanism in joins

May 17 '24 21:05 costin

OK! I was distracted last week. But this week I'm going to try to land this one. So! First thing is to merge main into it and get the tests passing. Next I need to corral all of the line comments into a big check list. There are a lot of line comments we've said "nope, we're not doing this" and there a lot we want to do and I've lost track. So I need to make a list.

Jun 03 '24 15:06 nik9000

Pinging @elastic/kibana-esql (ES|QL-ui)

Jun 03 '24 16:06 elasticsearchmachine

TODOS:

[x] https://github.com/elastic/elasticsearch/pull/107987#issuecomment-2118394344
[x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1587059151
[x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1601562300
[x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1601583844
[x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1601564991
[ ] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1601539235
[x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1601531116
[x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1601491776
[x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1600263894
[x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1600260873
[x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1600255564
[x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1600137357
[x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1601510974
[x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1600218810
[x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1600198174
[x] https://github.com/elastic/elasticsearch/pull/107987#discussion_r1600166496
[x] https://github.com/elastic/elasticsearch/pull/107987#pullrequestreview-2059782890

Jun 03 '24 17:06 nik9000

I'd like to hold https://github.com/elastic/elasticsearch/pull/107987#discussion_r1601539235 until a followup, I think. Otherwise, I believe this is ready for review. I'll need to double check it myself, but I believe I've knocked out the comments I could find.

Jun 03 '24 22:06 nik9000

It occurred to me this morning that I'm going to need a new capability and probably a transport version. That'll be a today thing. And finding tests for it.

Jun 04 '24 10:06 nik9000

cc @astefan and @bpintea for awareness

Jun 05 '24 05:06 costin

OK! I've pushed the last update from a comment. I think we've extracted all of the unfinished comments into our lovely followup meta issue (https://github.com/elastic/elasticsearch/issues/109353). I'm marking this auto-merge and beginning to land it. Thanks so much friends. 186 comments, 96 commits. Over a little more than a month. And 16 pre-PRs that had to land before that. And followups. This is a big one!

Jun 05 '24 13:06 nik9000

Oh darn. I thought the bwc failures were because of a release, but that doesn't make sense - we didn't have a branch cut. they were real. I fixed them and pushed.

Jun 07 '24 00:06 nik9000

elasticsearch elasticsearch copied to clipboard

ESQL: Implement LOOKUP, an "inline" enrich

elasticsearch
elasticsearch copied to clipboard