elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

Synthetic Source

Open nik9000 opened this issue 2 years ago • 4 comments

This shrinks the index by implementing a "synthetic" _source field. Instead of saving the field to disk we reconstruct it on the fly using our column store, doc values.

Before removing the feature flag

  • [x] Initial implementation #85649
  • [x] Figure out how much performance we'd get from using synthetic source for recovery - thus removing the `_recovery_source field. https://github.com/elastic/elasticsearch/pull/85649#issuecomment-1122258873
  • [x] Figure out final API to turn this on (@qhoxie)
  • [x] If it stays synthetic: true then we'll have to fail mappings that contain enabled: false, synthetic: true #87270
  • [x] Flip API from synthetic: true to synthetic: strict so we have room to add more later. We totally will. @romseygeek
  • [x] Rally track https://github.com/elastic/rally-tracks/pull/268 + https://github.com/elastic/rally-tracks/pull/270
  • [x] Resolve remaining round trip tests #86760
  • [x] Realtime GET should synthesize on load to return consistent results. (#87578)
  • [x] Add an option to _search to simulate synthetic source (#87068)
  • [x] Support the simulate option on GET and MGET (#87536 + #87574)
  • [x] Make sure we throw an error if you try to enable or disable synthetic source on an index #87182
  • [x] Support "subobjects" https://github.com/elastic/elasticsearch/pull/86166 (#87261)
  • [x] Docs #87416
  • [x] Figure out highlighting #87667

Later

  • [x] Make sure there is a nice error message when scripts try to access synethic source - it won't be there. They should use doc values or the fancy new fields API. @tmgordeeva #88334
  • [ ] Add support for more field types
    • [x] aggregate_metric_double field type #88909
    • [x] constant_keyword #88603
    • [x] dense_vector #89840
    • [x] histogram #89833
    • [x] keyword fields with ignore_above (#87480 + #89466)
    • [x] match_only_text #89516
    • [x] version #89706
  • [ ] Support fields in runtime fields scripts
    • [x] Numbers #89888
    • [x] ip #89888
    • [ ] text (#89950 + more)
    • [ ] keyword (#89950)
    • [ ] match_only_text (#89950)
  • [x] Support loading from stored fields (text would love it!) #87480
  • [ ] Rally tests for random fetch
  • [x] Look into the enrich processor (#89554)
  • [ ] Improve performance of synthesis
    • [x] General #87882
    • [x] Load column-wise #87930 #88025
    • [ ] Parallel loading?
  • [ ] Make fields API aware of synthetic-ness and go to doc values rather than rebuilding _source if _source isn't separately needed.
  • [ ] Document best practices for load over synthetic source
  • [ ] Support for ignore_malformed #90007
    • [ ] ip #90038
    • [ ] numeric
    • [ ] geo_point

Much later

  • [ ] Synthesize instead of using _recovery_source - we find that it'd improve write performance by ~11%. We'd have to synthesize on load instead. That's pretty slow. We'd love the 11% but we have to be careful here.

image

nik9000 avatar May 10 '22 12:05 nik9000

Pinging @elastic/es-search (Team:Search)

elasticmachine avatar May 10 '22 12:05 elasticmachine

Pinging @elastic/es-analytics-geo (Team:Analytics)

elasticmachine avatar May 10 '22 12:05 elasticmachine

@nik9000 does synthetic source leverage _source_include/_source_exclude for the fields it has to synthesize?

jsoriano avatar Jul 26 '22 10:07 jsoriano

@nik9000 does synthetic source leverage _source_include/_source_exclude for the fields it has to synthesize?

It does not. There is no support at the moment for any kind of partial synthesis.

nik9000 avatar Jul 26 '22 12:07 nik9000

Awesome feature, can't wait to have this in GA!!

rocco8620 avatar Oct 02 '22 21:10 rocco8620

Hello @nik9000 , can I pick some of the unchecked subtasks?

Kiriakos1998 avatar May 31 '23 16:05 Kiriakos1998

Hello @nik9000 , can I pick some of the unchecked subtasks?

I think all of the unchecked tasks are quick difficult to be honest. ignore_malformed are maybe easier, but I wouldn't suggest picking it up.

Also you'd need a committer buddy and I've had to move on to other tasks sadly. That might be quite difficult to find too.

nik9000 avatar May 31 '23 16:05 nik9000

@nik9000 does synthetic source leverage _source_include/_source_exclude for the fields it has to synthesize?

It does not. There is no support at the moment for any kind of partial synthesis.

Hi @nik9000 - just for my own clarity. You can either use mode: synthetic on its own or use the _source_include/_source_exclude ? But the two cannot be combined ? Is this correct ?

iby-dev avatar Mar 22 '24 15:03 iby-dev

Hi @nik9000 - just for my own clarity. You can either use mode: synthetic on its own or use the _source_include/_source_exclude ? But the two cannot be combined ? Is this correct ?

Right. I honestly didn't know how to combine them so I just declared combining them to be incompatible.

Keep in mind synthetic source is only GA for time series indices and data streams. I've had to move on to other things but expect folks will get back to working on getting synthetic source good in more contexts at some point soon.

nik9000 avatar Mar 22 '24 15:03 nik9000

Pinging @elastic/es-storage-engine (Team:StorageEngine)

elasticsearchmachine avatar May 31 '24 12:05 elasticsearchmachine