ES|QL: support name qualifiers
Description
- [ ] revisit https://github.com/elastic/elasticsearch/pull/110581 which removed the original implementation for qualifiers
- [ ] implement qualifiers for LOOKUP https://github.com/elastic/elasticsearch/issues/112014
- [ ] investigate and expand the name qualifiers to ENRICH https://github.com/elastic/elasticsearch/issues/112015
LOOKUP command is a powerful tool that helps users combine static tables of values with "live" data from Elasticsearch. It is consistent with the rest of the language features when it comes to same-name columns in that "the last column" wins.
For example (notice salary column that is present in employees index and salary column that is present in the static table benefits)
{
"query": "FROM employees | WHERE languages IS NOT NULL | SORT salary DESC | KEEP first_name, last_name, languages, salary | LOOKUP benefits ON languages | LIMIT 5",
"tables": {
"benefits": {
"languages": {"integer": [1, 2, 3, 4, 5]},
"salary": {"integer": [100,200,300,400,500]},
"bonus_percent":{"double": [5,7,9,11,15]}
}
}
}
will result in
first_name | last_name | languages | salary | bonus_percent
---------------+---------------+---------------+---------------+---------------
Moss |Shanbhogue |3 |300 |9.0
Tzvetan |Zielinski |4 |400 |11.0
Lillian |Haddadi |1 |100 |5.0
Valter |Sullins |2 |200 |7.0
Remzi |Waschkowski |3 |300 |9.0
salary column from employees is being replaced by the salary column from the benefits table.
One can argue, though, that both columns are useful and must be kept. There is an workaround here and requires few changes to the query ("manually" copying the old values in a new column):
{
"query": "FROM employees | WHERE languages IS NOT NULL | EVAL old_salary = salary | SORT old_salary DESC | KEEP first_name, last_name, languages, salary, old_salary | LOOKUP benefits ON languages | LIMIT 5",
"tables": {
"benefits": {
"languages": {"integer": [1, 2, 3, 4, 5]},
"salary": {"integer": [100,200,300,400,500]},
"bonus_percent":{"double": [5,7,9,11,15]}
}
}
}
which results in
first_name | last_name | languages | old_salary | salary | bonus_percent
---------------+---------------+---------------+---------------+---------------+---------------
Moss |Shanbhogue |3 |74970 |300 |9.0
Tzvetan |Zielinski |4 |74572 |400 |11.0
Lillian |Haddadi |1 |73717 |100 |5.0
Valter |Sullins |2 |73578 |200 |7.0
Remzi |Waschkowski |3 |71165 |300 |9.0
But we can do better and have these additional steps be added automatically in the form of name qualifiers:
{
"query": "FROM employees | WHERE languages IS NOT NULL | SORT salary DESC | KEEP first_name, last_name, languages, salary | LOOKUP benefits ON languages | LIMIT 5",
"tables": {
"benefits": {
"languages": {"integer": [1, 2, 3, 4, 5]},
"salary": {"integer": [100,200,300,400,500]},
"bonus_percent":{"double": [5,7,9,11,15]}
}
}
}
resulting in
first_name | last_name | languages | salary |employees.salary|benefits.salary| bonus_percent
---------------+---------------+---------------+---------------+----------------+---------------+---------------
Moss |Shanbhogue |3 |74970 |74970 |300 |9.0
Tzvetan |Zielinski |4 |74572 |74572 |400 |11.0
Lillian |Haddadi |1 |73717 |73717 |100 |5.0
Valter |Sullins |2 |73578 |73578 |200 |7.0
Remzi |Waschkowski |3 |71165 |71165 |300 |9.0
Pinging @elastic/es-analytical-engine (Team:Analytics)
Users will eventually need a way to provide their own name (alias) for the qualifier, as the index names can be quite long (.ds-metrics-prometheus.collector-mki-ksm-crd-scraper-2024.08.21-000645 for example). You wouldn't want to have to type that when referring to a column. It's a another nice layer on top of what's proposed above but I think we should think about it now to avoid making it harder later.
What happens if the main index resolves to more than one index? For example, if logs-* resolves to both logs-1 and logs-2, what would the columns be in this case?
We have at least two more options here:
- explicitly choose (and in case rename) the lookup columns:
LOOKUP benefits ON languages WITH benefits.salary = salary, bonus_percent`
It's the same syntax we have in ENRICH, so in terms of consistency it seems desirable
- declare a general prefix for all the fields extracted from the lookup:
LOOKUP benefits ON languages PREFIX `benefits.`
I think it's important to point out that any implicit behavior (eg. an implicit prefix as the name of the lookup table) will be a breaking change and won't avoid collisions (eg. employees could have a field called benefits.salary)
LOOKUP benefits ON languages WITH benefits.salary = salary, bonus_percent`It's the same syntax we have in ENRICH, so in terms of consistency it seems desirable
That'd be consistent, but the problem with the WITH syntax is that it requires explicit aliasing of each and every field. That applies even more so to ENRICH: if the enrich index has many fields, you may accidentally overwrite your fields, or you'll have to list each and every field you want. Qualifiers are one way to solve this, by avoiding name conflicts altogether.
I'm picking this back up, but the world has changed a little since we initially wanted this.
Instead of heavily revising this issue, I'm going to go and create a new one so we don't lose history.
Superseded by https://github.com/elastic/elasticsearch/issues/133068