sql
sql copied to clipboard
[FEATURE] PPL LOOKUP Functionality
Is your feature request related to a problem? OpenSearch users want an easy way to enrich the data they have stored in OpenSearch and external data sources using content from an OpenSearch index. This is common in security analytics scenarios where one wants to enrich their IP reputation lists, vulnerability databases, or threat feeds.
What solution would you like? Do a lookup of a field/value, from another log group and use that to convert to user friendly name/error code.
- The lookup feature should support the OpenSearch indexes as data sources for lookup tables
- Users should be able to perform a static lookup using the OpenSearch index and external data sources based on the Spark integration
- Users should be able to perform a static lookup using a user generated CSV such as an org unit mapping or GeoIP from MaxMind and OpenSearch index
- Users should be able to perform a static lookup using a user generated CSV such as an org unit mapping or GeoIP from MaxMind and an external data source
- Admins can use Index State Management to control how long the reference lookup index is available
- Include helpful error messages
*** Out of Scope ***
- Defining new lookup data sources beyond listed types
- Automatic field mapping between events and lookups
What alternatives have you considered? Performing joins using SQL
Do you have any additional context? None.
PPL Lookup Design Proposal
As implemented in PR 2698 the proposed design (and so far implemented) syntax is:
Design
The lookup
command can be implemented as a simple search for documents in the lookup index. For every row in a search result, a search with the given match fields are performed. If a single document is found, the fields and values of the lookup document are copied to the current row of the search result. If no document is found a no-op is performed. If multiple documents are found, an error is thrown. The implementation is mainly done in core/src/main/java/org/opensearch/sql/analysis/Analyzer.java
and opensearch/src/main/java/org/opensearch/sql/opensearch/storage/OpenSearchIndex.java
. The "lookup search" is performed as a term
and a match
query. So both cases are catched: When a field is not analyzed
or analyzed
with respect to it mapping.
The Spark PPL Lookup command implementation is done in separate PR in the opensearch-spark repo: PR 407. Here we can (and need) to implement it as a join
.
Syntax
lookup <lookup index> <lookup field> [AS <local lookup field>] [<lookup field> [AS <local lookup field>]]… [appendonly=true|false] [<source field> [AS <local field>]]...
lookup
is the name of the lookup operation and it is supposed to be changed to something else. Normally we would use lookup
but this seems to already used otherwise in the AST.
<lookup index>
is the name of the lookup index (mandatory).
Then we need at least one <lookup field>
which is a field in the lookup index used to match to a local field (in the current search) to get the lookup document. When there is no lookup document we just do nothing, if there is more than one we fail with an error.
If more than one <lookup field>
is provided, all of them must match (we do a term and a match query for the field value as of now)
If the field has a different name in the current search result use <local lookup field>
to map it.
appendonly
is false
by default abnd inidicates if the values we copy over to the search result from the lookup documemnt should overwrite existing values. If appendonly
is true
we do not overwrite existing values.
<source field>
are the fields that should be copied. If no such fields are given all fields are copied. If the field should have a different name than in the lookup document use [AS <local field>
Examples:
{"query":"source=logins | lookup users uid AS id appendonly=true"}
{"query":"source=logins | lookup users uid,name phone,department AS thedepartment"}
+1 for this feature
PPL Lookup Command Design
As implemented in https://github.com/opensearch-project/opensearch-spark/pull/686 the proposed design doc:
Overview
Lookup command enriches your search data by adding or replacing data from a lookup index (dimension table). You can extend fields of an index with values from a dimension table, append or replace values when lookup condition is matched. As an alternative of Join command, lookup command is more suitable for enriching the source data with a static dataset.
Syntax of Lookup Command
SEARCH source=<sourceIndex>
| <other piped command>
| LOOKUP <lookupIndex> (<lookupMappingField> [AS <sourceMappingField>])...
[(REPLACE | APPEND) (<inputField> [AS <outputField>])...]
| <other piped command>
lookupIndex
- Required
- Description: the name of lookup index (dimension table)
lookupMappingField
- Required
- Description: A mapping key in <lookupIndex>, analogy to a join key from right table. You can specify multiple <lookupMappingField> with comma-delimited.
sourceMappingField
- Optional
- Default: <lookupMappingField>
- Description: A mapping key from source query, analogy to a join key from left side. If you don't specify any <sourceMappingField>, its default value is <lookupMappingField>.
inputField
- Optional
- Default: All fields of <lookupIndex> where matched values are applied to result output if no field is specified.
- Description: A field in <lookupIndex> where matched values are applied to result output. You can specify multiple <inputField> with comma-delimited. If you don't specify any <inputField>, all fields of <lookupIndex> where matched values are applied to result output.
outputField
- Optional
- Default: <inputField>
- Description: A field of output. You can specify multiple <outputField>. If you specify <outputField> with an existing field name in source query, its values will be replaced or appended by matched values from <inputField>. If the field specified in <outputField> is a new field, an extended new field will be applied to the results.
REPLACE | APPEND
- Optional
- Default: REPLACE
- Description: If you specify REPLACE, matched values in <lookupIndex> field overwrite the values in result. If you specify APPEND, matched values in <lookupIndex> field only append to the missing values in result.
Usage
-
source = table1 | lookup table2 id
-
source = table1 | lookup table2 id, name
-
source = table1 | lookup table2 id as cid, name
-
source = table1 | lookup table2 id as cid, name replace dept as department
-
source = table1 | lookup table2 id as cid, name replace dept as department, city as location
-
source = table1 | lookup table2 id as cid, name append dept as department
-
source = table1 | lookup table2 id as cid, name append dept as department, city as location
Examples
SEARCH source=<sourceIndex>
| WHERE orderType = 'Cancelled'
| LOOKUP account_list, mkt_id AS mkt_code REPLACE amount, account_name AS name
| STATS count(mkt_code), avg(amount) BY name
SEARCH source=<sourceIndex>
| DEDUP market_id
| EVAL category=replace(category, "-", ".")
| EVAL category=ltrim(category, "dvp.")
| LOOKUP bounce_category category AS category APPEND classification
SEARCH source=<sourceIndex>
| LOOKUP bounce_category category
@YANG-DB guess we can close this one because it seems done in https://github.com/opensearch-project/opensearch-spark/pull/686 ?
@YANG-DB guess we can close this one because it seems done in opensearch-project/opensearch-spark#686 ?
We are keeping this open since this issue refers to the PPL OpenSearch engine