OpenSearch [Feature Request] Support for open/neutral data formats for engine agnostic reads

Is your feature request related to a problem? Please describe

While writing data in Lucene format enables faster queries, it also limits queries to use a compatible Lucene query engine. As data grows over time the need to keep engine compatible with the older data format imposes another constraint, preventing users to choose between getting benefits from newer versions vs keeping older format data readable. Then in order to upgrade the engine, the data indexed in older formats need to be re-indexed, which requires data to be read from the source field with a compatible Lucene engine before individual documents can be re-indexed into a target version.

Describe the solution you'd like

The source field stores the raw doc as a spl field, however this field can only be read by a compatible Lucene version. It be good if we could store this field in open/neutral format. This would enable users to

Use a query engine of their choice to query original data, even if the Lucene data formats changed
Be able to re-index seamlessly without having to get locked in by the source data format. For instance check the complexity involved

There could be caveats though with the query performance where actual doc needs to be returned, based on the data format, which needs to be evaluated further

Related component

Storage

Describe alternatives you've considered

No response

Additional context

No response

Mar 27 '24 18:03 Bukhtawar

@Bukhtawar - Thanks for the proposal. Format could mean two things here, i) format of the data represented as part of the document, ii) format of the data in rest (compressed and stored). Currently index codec defines both, are you suggesting to change both or just the first one?

Mar 28 '24 03:03 backslasht

Thanks @backslasht here I intend to keep the data stored at rest in a format that makes it easier for diverse query engines to be plugged in and helps data break free from the Lucene version compatibility constraints as much as possible.

Looping @reta @andrross @msfroh @tharejas @sachinpkale @gbbafna for thoughts

Mar 28 '24 08:03 Bukhtawar

Nice proposal!

I am trying to understand scope of this feature request with following questions:

For my understanding, is the source field part of Lucene segment today? if yes, even if we change its type from special field to a neutral type, say JSON, we still need Lucene to read the field first, right? Or are we proposing to store the source independent of segments?

Use a query engine of their choice to query original data, even if the Lucene data formats changed

Does querying original data from another query engine bypass OpenSearch or this also means OpenSearch support pluggable query engines?

Mar 28 '24 10:03 sachinpkale

As far as I remember, the source field could be disabled (and often is), so it looks to me the proposal is more about having a side store for raw data that is being ingested into the index?

Mar 28 '24 13:03 reta

+1 . I like the overall idea of decoupling the source from the engine. Couple of questions/thoughts

Would that mean storing the open data format always as opposed to making it optional : The reason I am asking is if we are able to reindex without source itself somehow theoretically, that could make it a cheaper alternate.
Do we need to explore multiple/pluggable Lucene engines to get around this problem of incomptability ?

Mar 28 '24 13:03 gbbafna

it looks to me the proposal is more about having a side store for raw data that is being ingested into the index?

@reta @Bukhtawar There indeed seems to be some overlap here with an ingestion tool like data prepper where you can configure another sink along side OpenSearch and store the data in a neutral, analytics-friendly format. The two use cases listed in this issue ("use any query engine" and "reindex seamlessly") could be solved by ingesting the original data into an additional sink. However, in that case OpenSearch has no knowledge of the other data and cannot use it the way that it uses the source field today. It's an interesting thought to consider if we can replace the existing source field that OpenSearch knows about and uses with a neutral, more future-proof format and kind of get the best of both worlds.

Mar 28 '24 16:03 andrross

Thanks @Bukhtawar for the proposal.

I definitely see the value of storing _source field in a data format (considering it is just document blob) which is not bound to lucene engine version, especially for re-indexing..

Mar 28 '24 16:03 shwetathareja

As far as I remember, the source field could be disabled (and often is), so it looks to me the proposal is more about having a side store for raw data that is being ingested into the index?

Thats true, I don't think you can rely on the _source field, since it can be disabled. https://www.elastic.co/guide/en/elasticsearch/reference/7.11/mapping-source-field.html#disable-source-field

Mar 28 '24 17:03 anasalkouz

As far as I remember, the source field could be disabled (and often is), so it looks to me the proposal is more about having a side store for raw data that is being ingested into the index?

Thats true, I don't think you can rely on the _source field, since it can be disabled. https://www.elastic.co/guide/en/elasticsearch/reference/7.11/mapping-source-field.html#disable-source-field

Obviously we are talking about the new data format which will be applicable for newer version onwards. Based on how the proposal goes we can always decide to change that if we see good benefits espl as OpenSearch has good support for durability but gets constrained on data compatibility

Mar 28 '24 17:03 Bukhtawar

A few thoughts:

Being able to guarantee access to the original documents ingested by a cluster would be awesome for enabling multi-version upgrades going forward, but all storage has a cost. I can see arguments that the feature would need to be optional for that reason.
If we consider an alternative store than Lucene segments, we'll also need to tackle some of the things those currently give us - like handling updates/deletes. Not necessarily an argument against the approach, but just wanted to point out some additional work to be done to support.
Having the original docs outside of Lucene segments would make parsing/reindexing easier and allows us more flexibility to split up reindexing across many workers instead of having a strong inclination to have a worker-per-shard due to wanting to treat the shards as separate Lucene indices in order to extract the docs.

Question: @Bukhtawar Do we think that storing the original docs outside of Lucene would enable us to compress them better, reducing the burden of storage?

Apr 03 '24 18:04 chelma

Hi @Bukhtawar that's a very interesting suggestion! Some clarification questions to make sure I get it right:

Today _source is a codec that extends StoredFieldFormat in Lucene. Are you suggesting to move entirely from Lucene interface of StoredFieldFormat into a new interface?
Or are you suggesting to keep Lucene interface and only extend the StoredInterfaceFormat with a non default Lucene codec that can be more easily read by other systems?

Context: I currently have a working POC in which I extended the _source field to work with Parquet format. I have done so by extending the StoredFieldFormat in Lucene interfaces. I would love to share any cons/pros I have seen.

Apr 29 '24 16:04 sam-herman