OpenSearch
OpenSearch copied to clipboard
Add flattened field type
(updated from https://github.com/opensearch-project/OpenSearch/issues/1018#issuecomment-1188365805 below, @macrakis)
[Design Proposal] The flat data type in OpenSearch
Summary
JSON objects whose components are not indexed are a proposed new field type in OpenSearch; we call them flat objects, and their mapping type is flattened. Subfields within the JSON are accessible using standard dot path notation in DSL, SQL, and Painless, but are not indexed for fast lookup.
Flat subfields support exact match queries and textual sorting.
Flat fields and subfields do not support type-specific parsing, numerical operations such as numerical comparison and sorting, text analyzers, or highlighting.
Motivation
- Flat fields are useful when a field and its subfields will mostly be read, and not be used as search criteria.
- Flat fields do not create a large number of fields, one per unique key. The āmapping explosionā caused by mapping many individual fields can lead to heavy RAM use and thrashing. (memory efficiency)
- Flat fields do not have inverted indexes which take space. (Space efficiency)
- Migration from systems supporting similar flat types is easy. (Compatibility)
Demand
OpenDistro: Provide alternative to datatype flattened #523 (Dec 21, 2020)
- Prevent mapping explosion.
OpenDistro: Migration via Snapshot restore. Mapping roadblocks (Dec 2021)
- Migrating corpus with existing flattened mappings.
Specification
Mapping and ingestion
- flattened is a new mapping type
- Fields declared flattened are ingested as structured, nested objects.
- Neither the field as a whole nor its subfields are indexed.
- The nested fields are uninterpreted keywords, and sort like strings. Nested fields cannot be used as dates or numbers.
Searching and retrieving
- Supports fetching a subfields with the usual dotted notation.
- Supports aggregations of subfields with the usual dotted notation.
- Filtering by subfields is supported, but may be inefficient (full scan).
Example
This declares catalog as being of type flattened:
{ "mappings":
{ "book" :
{ "properties" :
{ "ISBN13" : "keyword",
"catalog" : "flattened" }
}}}
Consider the ingestion of the following document:
{
{ "ISBN13" : "V9781933988177",
"catalog" :
{ "title" : "Lucene in Action",
"author1" :
{ "surname" : "McCandless",
"given" : "Mike" }
}}}
Upon ingestion, this will create two indexes: one for ISBN13, and one for catalog. The surname field (e.g.) may be accessed as catalog.author1.surname. But the title, author1, surname, and given fields (and the words within them) will not be indexed.
Performance
- Performance should be similar to a keyword field.
- Fetching the value of a nested field using dot paths in a given document should be efficient.
- Finding a document with a specific value of a nested field (e.g., given = āMikeā) is not efficient: it may require a full scan of the index, an expensive operation.
Limitations
Flattened fields and subfields do not support type-specific operations, including parsing, numerical operations (including numerical ranges), text analyzers, or highlighting.
Possible implementation
These are some possible implementations. The implementation may use one of these or something else that satisfies the requirements.
- Flattened fields could be stored as Lucene blobs whose internal structure is defined by OpenSearch.
- The internal structure should be such that dot path references (such as .catalog.author1.given) are cheap. Perhaps binary JSON.
Security
Flattened fields are treated as atomic for the purposes of security. Individual subfields cannot have different security properties than the field as a whole.
Possible enhancements
The current specification is minimal. It intentionally does not include many options offered by other vendors.
Depending on the user feedback we receive after the initial release, various enhancements are possible:
- Querying the field as a whole searches all leaf values. For example, catalog=Mike would match the document above.
- Fine tune efficiency with various options controlling query interpretation, etc.
- Provide a concatenated index. In that index, the entry for the given field above would be something like āCatalog|author1|given=Mikeā. This would provide efficient searching by field (assuming that indexes support prefix compression).
- Allow specifying that certain subfields should be indexed separately. (This could also be provided in logstash.)
- Support wildcards in field names.
i'd like to underline the need for an offical feature complete and performant alternative to the flattened datatype and it also was requested a lot on several sites already: https://github.com/opendistro-for-elasticsearch/opendistro-build/issues/523, https://discuss.opendistrocommunity.dev/t/flattened-type-with-opendistro/5014/4, https://forums.aws.amazon.com/thread.jspa?threadID=32970, ...
@dblock and @nknize is this in Lucene already? I couldn't tell from the Lucene docs. If it isn't should it be contributed there and then pulled into OpenSearch? Seems like it could add value in Lucene too.
@aparo this link no longer there https://github.com/aparo/opensearch-flattened-mapper-plugin :(
Is there any plan to support this functionality in the near future?
Is there any plan to support this functionality in the near future?
I don't think anyone is working on it, cc: @anasalkouz?
I was involved in the discussion recently on the subject [1], it would be really beneficial to have the flattened type but I believe the [2] was the subject of the IP / copyright claims (@aparo would be great to hear the reasons, many people are asking, thank you :pray:). To keep it short: we probably could add something similar to flattened type but with different name (and obviously implementation), but migrating existing Elasticsearch indices using snapshot / restore would be problematic (unless we would internally support type aliases etc.) since type won't match.
[1] https://discuss.opendistrocommunity.dev/t/migration-via-snapshot-restore-mapping-roadblocks/7906/7 [2] https://github.com/aparo/opensearch-flattened-mapper-plugin
Hi, Do you know when can we expect to have the new flattened type implemented? It is very crucial for our business scenario. Thanks, Andrea.
@andreaAlkalay @abhishek-v it's not on any roadmap today, but we would gladly accept a PR
+1!
@andreaAlkalay Can you tell us more about your business scenario and why the flattened type is important to it? Thanks.
@andreaAlkalay Can you tell us more about your business scenario and why the flattened type is important to it? Thanks.
Let me jump in :)
We have a case when kubernetes pod having built-in labels like app=foo, also we have a bunch of services which use label like app.kubernetes.io/managed-by: Helm. In first case field app in just a string. In other case it is nested object. When logshipper send such entry to opensearch it throws back mapper_parsing_exception and drops documents
Flattened type solves such problems for fields that you don't want to use as nested
@anasalkouz heya Anas -- what's the latest on this?
(question is also for @macrakis) :)
[Design Proposal] The flat data type in OpenSearch
Summary
JSON objects whose components are not indexed are a proposed new field type in OpenSearch; we call them flat objects, and their mapping type is flattened. Subfields within the JSON are accessible using standard dot path notation in DSL, SQL, and Painless, but are not indexed for fast lookup.
Flat subfields support exact match queries and textual sorting.
Flat fields and subfields do not support type-specific parsing, numerical operations such as numerical comparison and sorting, text analyzers, or highlighting.
Motivation
- Flat fields are useful when a field and its subfields will mostly be read, and not be used as search criteria.
- Flat fields do not create a large number of fields, one per unique key. The āmapping explosionā caused by mapping many individual fields can lead to heavy RAM use and thrashing. (memory efficiency)
- Flat fields do not have inverted indexes which take space. (Space efficiency)
- Migration from systems supporting similar flat types is easy. (Compatibility)
Demand
OpenDistro: Provide alternative to datatype flattened #523 (Dec 21, 2020)
- Prevent mapping explosion.
OpenDistro: Migration via Snapshot restore. Mapping roadblocks (Dec 2021)
- Migrating corpus with existing flattened mappings.
Specification
Mapping and ingestion
- flattened is a new mapping type
- Fields declared flattened are ingested as structured, nested objects.
- Neither the field as a whole nor its subfields are indexed.
- The nested fields are uninterpreted keywords, and sort like strings. Nested fields cannot be used as dates or numbers.
Searching and retrieving
- Supports fetching a subfields with the usual dotted notation.
- Supports aggregations of subfields with the usual dotted notation.
- Filtering by subfields is supported, but may be inefficient (full scan).
Example
This declares catalog as being of type flattened:
{ "mappings":
{ "book" :
{ "properties" :
{ "ISBN13" : "keyword",
"catalog" : "flattened" }
}}}
Consider the ingestion of the following document:
{
{ "ISBN13" : "V9781933988177",
"catalog" :
{ "title" : "Lucene in Action",
"author1" :
{ "surname" : "McCandless",
"given" : "Mike" }
}}}
Upon ingestion, this will create two indexes: one for ISBN13, and one for catalog. The surname field (e.g.) may be accessed as catalog.author1.surname. But the title, author1, surname, and given fields (and the words within them) will not be indexed.
Performance
- Performance should be similar to a keyword field.
- Fetching the value of a nested field using dot paths in a given document should be efficient.
- Finding a document with a specific value of a nested field (e.g., given = āMikeā) is not efficient: it may require a full scan of the index, an expensive operation.
Limitations
Flattened fields and subfields do not support type-specific operations, including parsing, numerical operations (including numerical ranges), text analyzers, or highlighting.
Possible implementation
These are some possible implementations. The implementation may use one of these or something else that satisfies the requirements.
- Flattened fields could be stored as Lucene blobs whose internal structure is defined by OpenSearch.
- The internal structure should be such that dot path references (such as .catalog.author1.given) are cheap. Perhaps binary JSON.
Security
Flattened fields are treated as atomic for the purposes of security. Individual subfields cannot have different security properties than the field as a whole.
Possible enhancements
The current specification is minimal. It intentionally does not include many options offered by other vendors.
Depending on the user feedback we receive after the initial release, various enhancements are possible:
- Querying the field as a whole searches all leaf values. For example, catalog=Mike would match the document above.
- Fine tune efficiency with various options controlling query interpretation, etc.
- Provide a concatenated index. In that index, the entry for the given field above would be something like āCatalog|author1|given=Mikeā. This would provide efficient searching by field (assuming that indexes support prefix compression).
- Allow specifying that certain subfields should be indexed separately. (This could also be provided in logstash.)
- Support wildcards in field names.
@dblock I think we should move the design proposal @macrakis pasted here into the issue summary as it is a pretty comprehensive proposal for the feature. Do you have any issue with that? I can make the change.
No issues with that! Sounds great.
@dblock @macrakis @elfisher Done.
Thanks @CEHENKLE!
I think for posterity we might want to do this not by replacing the original content, but adding Update ... and linking and copy-pasting something from below, just so it doesn't look like I did all the work on that proposal (@macrakis did ;)).
Good point, @dblock . Will do it that way going forward.
@aabukhalil Can you pick this up?
@CEHENKLE yes I will be working on this
Open questions:
- Do we have any legal concern when using
flattenedorflatfield type ? We need to close on what name to use.- Since one of the motivations and demand to implement this feature is the ease of migration (Compatibility), what should we do if we agreed to not use
flattenedas name ? not using matching name will make migration harder. should we introduce field type aliasing ?
- Since one of the motivations and demand to implement this feature is the ease of migration (Compatibility), what should we do if we agreed to not use
Checklist of things to do:
- [x] How indexing, document writing, document mapping and field mapping is done.
- [ ] How to store the data on top of Lucene to support doc_value like access pattern for the subfields without causing mapping explosion (mapping overhead of flat datatype should be O(1)). In other words, how to overload single Lucene field to hold an object while allowing efficient dotted access. This is needed to support accessing subfields in the flat object for retrieval and aggregation.
- [ ] How snapshot work and how to support the new field type ? and how can we restore a snapshot created by ElasticSearch having āflattenedā field type into OpenSearch ?
- [ ] After confirming implementation details and how data will be stored, Think about forward compatibility when adding more features to this new field type without causing migration, if possible at all.
@aabukhalil I agree, going with flattened would be ideal and significantly ease the migration, but it is indeed poses legal risks. We actually had a very similar discussion regarding dense_vector field type [1], may be the sign off from one of the PMs (@CEHENKLE ?) would help here.
[1] https://github.com/opensearch-project/OpenSearch/issues/3545#issuecomment-1164749810
@reta yes I'm asking for help regarding legal implications.
When designing how to store the new field type, we should take into consideration forward compatibility and future extensibility for this field. Because depending on how we store the data, some features can be added easily by us and activated easily by customers. Otherwise, if we donāt count for future features, a full revamp for how the field is stored might be needed and we will lose compatibility between versions. What do you think ? @dblock @nknize @macrakis @reta I need your opinion
@chipzzz thanks for your feedback. I'm sorry but I didn't get what do you mean by lag ? which lag ? and what do you mean by event here. Can you please elaborate more so we can help ? even if you can provide samples that would help
Just my 2 cents regarding the naming,
given the implementation will not guarantee the exact same functionality to what flattened filed provides in Elasticsearch we should intentionally not try to use similar naming. Going with different naming will make it clear that the functionality can differ (and also a clear signal that there are no legal concerns).
When migrating I think it is better for user to cope with the fact that the mapping field naming is not exactly the same than learning later that despite the naming was the same the functionality is actually not.
@lukas-vlcek Do you have an alternate naming proposal? Let's only discuss technical (vs. legal) merits of our options?
@mrkamel @abhishek-v @reta @andreaAlkalay @amalgamm @lukas-vicek The current Design Proposal is intentionally minimal. It covers the core functionality of the flat data type, namely ingesting nested objects as a single object and not indexing individual subfields. This has good performance characteristics and avoids mapping explosion. However, it does not implement the many options available on other systems. Notably it does not index the subfields. It also does not support snapshot restore from Elastic indexes. If I'm not mistaken, snapshots aren't even guaranteed compatible between different versions of Elastic.
What we'd like to know is whether that meets your needs. If not, which additional options are useful to you, and why?
The goal is to base feature development on your needs so as to keep the feature simple and performant.
@macrakis thanks for the notification. Not sure what not indexing individual subfields means performance wise for queries. Our use cases cover mostly i) querying leaf key/value pairs like with keyword fields, ii) aggregating leaf keys iii) dot retrieval and iv) textual sorting. Regarding querying we mostly use term/terms queries, but range/exists queries would be nice also. Querying without specifying a concrete leaf key is not important for us.
Regarding querying performance, it was stated that Filtering by subfields is supported, but may be inefficient (full scan) and Finding a document with a specific value of a nested field (e.g., given = āMikeā) is not efficient: it may require a full scan of the index, an expensive operation.
Does all that mean we can expect much worse query performance compared to the elasticsearch flattened type for a query like:
POST bug_reports/_search
{
"query": {
"term": {"labels.release": "v1.3.0"}
}
}
where labels is of type flattened and release is a leaf key? Comparable performance for those queries is very important to us.