OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

Add flattened field type

Open dblock opened this issue 4 years ago • 31 comments

(updated from https://github.com/opensearch-project/OpenSearch/issues/1018#issuecomment-1188365805 below, @macrakis)

[Design Proposal] The flat data type in OpenSearch

Summary

JSON objects whose components are not indexed are a proposed new field type in OpenSearch; we call them flat objects, and their mapping type is flattened. Subfields within the JSON are accessible using standard dot path notation in DSL, SQL, and Painless, but are not indexed for fast lookup.

Flat subfields support exact match queries and textual sorting.

Flat fields and subfields do not support type-specific parsing, numerical operations such as numerical comparison and sorting, text analyzers, or highlighting.

Motivation

  • Flat fields are useful when a field and its subfields will mostly be read, and not be used as search criteria.
  • Flat fields do not create a large number of fields, one per unique key. The ā€œmapping explosionā€ caused by mapping many individual fields can lead to heavy RAM use and thrashing. (memory efficiency)
  • Flat fields do not have inverted indexes which take space. (Space efficiency)
  • Migration from systems supporting similar flat types is easy. (Compatibility)

Demand

OpenDistro: Provide alternative to datatype flattened #523 (Dec 21, 2020)

  • Prevent mapping explosion.

OpenDistro: Migration via Snapshot restore. Mapping roadblocks (Dec 2021)

  • Migrating corpus with existing flattened mappings.

Specification

Mapping and ingestion

  • flattened is a new mapping type
  • Fields declared flattened are ingested as structured, nested objects.
  • Neither the field as a whole nor its subfields are indexed.
  • The nested fields are uninterpreted keywords, and sort like strings. Nested fields cannot be used as dates or numbers.

Searching and retrieving

  • Supports fetching a subfields with the usual dotted notation.
  • Supports aggregations of subfields with the usual dotted notation.
  • Filtering by subfields is supported, but may be inefficient (full scan).

Example

This declares catalog as being of type flattened:

{ "mappings": 
  { "book" :
    { "properties" :
       { "ISBN13"  : "keyword",
         "catalog" : "flattened" }
}}}

Consider the ingestion of the following document:

{ 
  { "ISBN13" : "V9781933988177",
    "catalog" : 
       { "title" : "Lucene in Action",
         "author1" : 
            { "surname" : "McCandless",
              "given"   : "Mike" }
}}}

Upon ingestion, this will create two indexes: one for ISBN13, and one for catalog. The surname field (e.g.) may be accessed as catalog.author1.surname. But the title, author1, surname, and given fields (and the words within them) will not be indexed.

Performance

  • Performance should be similar to a keyword field.
  • Fetching the value of a nested field using dot paths in a given document should be efficient.
  • Finding a document with a specific value of a nested field (e.g., given = ā€˜Mike’) is not efficient: it may require a full scan of the index, an expensive operation.

Limitations

Flattened fields and subfields do not support type-specific operations, including parsing, numerical operations (including numerical ranges), text analyzers, or highlighting.

Possible implementation

These are some possible implementations. The implementation may use one of these or something else that satisfies the requirements.

  • Flattened fields could be stored as Lucene blobs whose internal structure is defined by OpenSearch.
  • The internal structure should be such that dot path references (such as .catalog.author1.given) are cheap. Perhaps binary JSON.

Security

Flattened fields are treated as atomic for the purposes of security. Individual subfields cannot have different security properties than the field as a whole.

Possible enhancements

The current specification is minimal. It intentionally does not include many options offered by other vendors.

Depending on the user feedback we receive after the initial release, various enhancements are possible:

  • Querying the field as a whole searches all leaf values. For example, catalog=Mike would match the document above.
  • Fine tune efficiency with various options controlling query interpretation, etc.
  • Provide a concatenated index. In that index, the entry for the given field above would be something like ā€œCatalog|author1|given=Mikeā€. This would provide efficient searching by field (assuming that indexes support prefix compression).
  • Allow specifying that certain subfields should be indexed separately. (This could also be provided in logstash.)
  • Support wildcards in field names.

dblock avatar Jul 28 '21 13:07 dblock

i'd like to underline the need for an offical feature complete and performant alternative to the flattened datatype and it also was requested a lot on several sites already: https://github.com/opendistro-for-elasticsearch/opendistro-build/issues/523, https://discuss.opendistrocommunity.dev/t/flattened-type-with-opendistro/5014/4, https://forums.aws.amazon.com/thread.jspa?threadID=32970, ...

mrkamel avatar Oct 07 '21 18:10 mrkamel

@dblock and @nknize is this in Lucene already? I couldn't tell from the Lucene docs. If it isn't should it be contributed there and then pulled into OpenSearch? Seems like it could add value in Lucene too.

elfisher avatar Nov 18 '21 20:11 elfisher

@aparo this link no longer there https://github.com/aparo/opensearch-flattened-mapper-plugin :(

chipzzz avatar Dec 15 '21 21:12 chipzzz

Is there any plan to support this functionality in the near future?

abhishek-v avatar Dec 17 '21 06:12 abhishek-v

Is there any plan to support this functionality in the near future?

I don't think anyone is working on it, cc: @anasalkouz?

dblock avatar Dec 18 '21 15:12 dblock

I was involved in the discussion recently on the subject [1], it would be really beneficial to have the flattened type but I believe the [2] was the subject of the IP / copyright claims (@aparo would be great to hear the reasons, many people are asking, thank you :pray:). To keep it short: we probably could add something similar to flattened type but with different name (and obviously implementation), but migrating existing Elasticsearch indices using snapshot / restore would be problematic (unless we would internally support type aliases etc.) since type won't match.

[1] https://discuss.opendistrocommunity.dev/t/migration-via-snapshot-restore-mapping-roadblocks/7906/7 [2] https://github.com/aparo/opensearch-flattened-mapper-plugin

reta avatar Jan 14 '22 20:01 reta

Hi, Do you know when can we expect to have the new flattened type implemented? It is very crucial for our business scenario. Thanks, Andrea.

andreaAlkalay avatar Feb 01 '22 08:02 andreaAlkalay

@andreaAlkalay @abhishek-v it's not on any roadmap today, but we would gladly accept a PR

dblock avatar Feb 02 '22 18:02 dblock

+1!

tristandostaler avatar Feb 24 '22 19:02 tristandostaler

@andreaAlkalay Can you tell us more about your business scenario and why the flattened type is important to it? Thanks.

macrakis avatar Mar 14 '22 16:03 macrakis

@andreaAlkalay Can you tell us more about your business scenario and why the flattened type is important to it? Thanks.

Let me jump in :) We have a case when kubernetes pod having built-in labels like app=foo, also we have a bunch of services which use label like app.kubernetes.io/managed-by: Helm. In first case field app in just a string. In other case it is nested object. When logshipper send such entry to opensearch it throws back mapper_parsing_exception and drops documents

Flattened type solves such problems for fields that you don't want to use as nested

amalgamm avatar Apr 04 '22 11:04 amalgamm

@anasalkouz heya Anas -- what's the latest on this?

CEHENKLE avatar Jun 14 '22 08:06 CEHENKLE

(question is also for @macrakis) :)

CEHENKLE avatar Jun 14 '22 09:06 CEHENKLE

[Design Proposal] The flat data type in OpenSearch

Summary

JSON objects whose components are not indexed are a proposed new field type in OpenSearch; we call them flat objects, and their mapping type is flattened. Subfields within the JSON are accessible using standard dot path notation in DSL, SQL, and Painless, but are not indexed for fast lookup.

Flat subfields support exact match queries and textual sorting.

Flat fields and subfields do not support type-specific parsing, numerical operations such as numerical comparison and sorting, text analyzers, or highlighting.

Motivation

  • Flat fields are useful when a field and its subfields will mostly be read, and not be used as search criteria.
  • Flat fields do not create a large number of fields, one per unique key. The ā€œmapping explosionā€ caused by mapping many individual fields can lead to heavy RAM use and thrashing. (memory efficiency)
  • Flat fields do not have inverted indexes which take space. (Space efficiency)
  • Migration from systems supporting similar flat types is easy. (Compatibility)

Demand

OpenDistro: Provide alternative to datatype flattened #523 (Dec 21, 2020)

  • Prevent mapping explosion.

OpenDistro: Migration via Snapshot restore. Mapping roadblocks (Dec 2021)

  • Migrating corpus with existing flattened mappings.

Specification

Mapping and ingestion

  • flattened is a new mapping type
  • Fields declared flattened are ingested as structured, nested objects.
  • Neither the field as a whole nor its subfields are indexed.
  • The nested fields are uninterpreted keywords, and sort like strings. Nested fields cannot be used as dates or numbers.

Searching and retrieving

  • Supports fetching a subfields with the usual dotted notation.
  • Supports aggregations of subfields with the usual dotted notation.
  • Filtering by subfields is supported, but may be inefficient (full scan).

Example

This declares catalog as being of type flattened:

{ "mappings": 
  { "book" :
    { "properties" :
       { "ISBN13"  : "keyword",
         "catalog" : "flattened" }
}}}

Consider the ingestion of the following document:

{ 
  { "ISBN13" : "V9781933988177",
    "catalog" : 
       { "title" : "Lucene in Action",
         "author1" : 
            { "surname" : "McCandless",
              "given"   : "Mike" }
}}}

Upon ingestion, this will create two indexes: one for ISBN13, and one for catalog. The surname field (e.g.) may be accessed as catalog.author1.surname. But the title, author1, surname, and given fields (and the words within them) will not be indexed.

Performance

  • Performance should be similar to a keyword field.
  • Fetching the value of a nested field using dot paths in a given document should be efficient.
  • Finding a document with a specific value of a nested field (e.g., given = ā€˜Mike’) is not efficient: it may require a full scan of the index, an expensive operation.

Limitations

Flattened fields and subfields do not support type-specific operations, including parsing, numerical operations (including numerical ranges), text analyzers, or highlighting.

Possible implementation

These are some possible implementations. The implementation may use one of these or something else that satisfies the requirements.

  • Flattened fields could be stored as Lucene blobs whose internal structure is defined by OpenSearch.
  • The internal structure should be such that dot path references (such as .catalog.author1.given) are cheap. Perhaps binary JSON.

Security

Flattened fields are treated as atomic for the purposes of security. Individual subfields cannot have different security properties than the field as a whole.

Possible enhancements

The current specification is minimal. It intentionally does not include many options offered by other vendors.

Depending on the user feedback we receive after the initial release, various enhancements are possible:

  • Querying the field as a whole searches all leaf values. For example, catalog=Mike would match the document above.
  • Fine tune efficiency with various options controlling query interpretation, etc.
  • Provide a concatenated index. In that index, the entry for the given field above would be something like ā€œCatalog|author1|given=Mikeā€. This would provide efficient searching by field (assuming that indexes support prefix compression).
  • Allow specifying that certain subfields should be indexed separately. (This could also be provided in logstash.)
  • Support wildcards in field names.

macrakis avatar Jul 18 '22 22:07 macrakis

@dblock I think we should move the design proposal @macrakis pasted here into the issue summary as it is a pretty comprehensive proposal for the feature. Do you have any issue with that? I can make the change.

elfisher avatar Jul 19 '22 12:07 elfisher

No issues with that! Sounds great.

dblock avatar Jul 21 '22 16:07 dblock

@dblock @macrakis @elfisher Done.

CEHENKLE avatar Jul 21 '22 19:07 CEHENKLE

Thanks @CEHENKLE!

elfisher avatar Jul 21 '22 19:07 elfisher

I think for posterity we might want to do this not by replacing the original content, but adding Update ... and linking and copy-pasting something from below, just so it doesn't look like I did all the work on that proposal (@macrakis did ;)).

dblock avatar Jul 21 '22 19:07 dblock

Good point, @dblock . Will do it that way going forward.

@aabukhalil Can you pick this up?

CEHENKLE avatar Jul 25 '22 17:07 CEHENKLE

@CEHENKLE yes I will be working on this

aabukhalil avatar Jul 25 '22 17:07 aabukhalil

Open questions:

  • Do we have any legal concern when using flattened or flat field type ? We need to close on what name to use.
    • Since one of the motivations and demand to implement this feature is the ease of migration (Compatibility), what should we do if we agreed to not use flattened as name ? not using matching name will make migration harder. should we introduce field type aliasing ?

Checklist of things to do:

  • [x] How indexing, document writing, document mapping and field mapping is done.
  • [ ] How to store the data on top of Lucene to support doc_value like access pattern for the subfields without causing mapping explosion (mapping overhead of flat datatype should be O(1)). In other words, how to overload single Lucene field to hold an object while allowing efficient dotted access. This is needed to support accessing subfields in the flat object for retrieval and aggregation.
  • [ ] How snapshot work and how to support the new field type ? and how can we restore a snapshot created by ElasticSearch having ā€œflattenedā€ field type into OpenSearch ?
  • [ ] After confirming implementation details and how data will be stored, Think about forward compatibility when adding more features to this new field type without causing migration, if possible at all.

aabukhalil avatar Aug 04 '22 01:08 aabukhalil

@aabukhalil I agree, going with flattened would be ideal and significantly ease the migration, but it is indeed poses legal risks. We actually had a very similar discussion regarding dense_vector field type [1], may be the sign off from one of the PMs (@CEHENKLE ?) would help here.

[1] https://github.com/opensearch-project/OpenSearch/issues/3545#issuecomment-1164749810

reta avatar Aug 04 '22 19:08 reta

@reta yes I'm asking for help regarding legal implications.

aabukhalil avatar Aug 04 '22 19:08 aabukhalil

When designing how to store the new field type, we should take into consideration forward compatibility and future extensibility for this field. Because depending on how we store the data, some features can be added easily by us and activated easily by customers. Otherwise, if we don’t count for future features, a full revamp for how the field is stored might be needed and we will lose compatibility between versions. What do you think ? @dblock @nknize @macrakis @reta I need your opinion

aabukhalil avatar Aug 04 '22 19:08 aabukhalil

@chipzzz thanks for your feedback. I'm sorry but I didn't get what do you mean by lag ? which lag ? and what do you mean by event here. Can you please elaborate more so we can help ? even if you can provide samples that would help

aabukhalil avatar Aug 04 '22 20:08 aabukhalil

Just my 2 cents regarding the naming,

given the implementation will not guarantee the exact same functionality to what flattened filed provides in Elasticsearch we should intentionally not try to use similar naming. Going with different naming will make it clear that the functionality can differ (and also a clear signal that there are no legal concerns).

When migrating I think it is better for user to cope with the fact that the mapping field naming is not exactly the same than learning later that despite the naming was the same the functionality is actually not.

lukas-vlcek avatar Aug 04 '22 20:08 lukas-vlcek

@lukas-vlcek Do you have an alternate naming proposal? Let's only discuss technical (vs. legal) merits of our options?

dblock avatar Aug 05 '22 15:08 dblock

@mrkamel @abhishek-v @reta @andreaAlkalay @amalgamm @lukas-vicek The current Design Proposal is intentionally minimal. It covers the core functionality of the flat data type, namely ingesting nested objects as a single object and not indexing individual subfields. This has good performance characteristics and avoids mapping explosion. However, it does not implement the many options available on other systems. Notably it does not index the subfields. It also does not support snapshot restore from Elastic indexes. If I'm not mistaken, snapshots aren't even guaranteed compatible between different versions of Elastic.

What we'd like to know is whether that meets your needs. If not, which additional options are useful to you, and why?

The goal is to base feature development on your needs so as to keep the feature simple and performant.

macrakis avatar Aug 06 '22 00:08 macrakis

@macrakis thanks for the notification. Not sure what not indexing individual subfields means performance wise for queries. Our use cases cover mostly i) querying leaf key/value pairs like with keyword fields, ii) aggregating leaf keys iii) dot retrieval and iv) textual sorting. Regarding querying we mostly use term/terms queries, but range/exists queries would be nice also. Querying without specifying a concrete leaf key is not important for us.

Regarding querying performance, it was stated that Filtering by subfields is supported, but may be inefficient (full scan) and Finding a document with a specific value of a nested field (e.g., given = ā€˜Mike’) is not efficient: it may require a full scan of the index, an expensive operation.

Does all that mean we can expect much worse query performance compared to the elasticsearch flattened type for a query like:

POST bug_reports/_search
{
  "query": {
    "term": {"labels.release": "v1.3.0"}
  }
}

where labels is of type flattened and release is a leaf key? Comparable performance for those queries is very important to us.

mrkamel avatar Aug 06 '22 19:08 mrkamel