elasticsearch
elasticsearch copied to clipboard
Synthetic Source
This shrinks the index by implementing a "synthetic" _source field. Instead of saving the field to disk we reconstruct it on the fly using our column store, doc values.
Before removing the feature flag
- [x] Initial implementation #85649
- [x] Figure out how much performance we'd get from using synthetic source for recovery - thus removing the `_recovery_source field. https://github.com/elastic/elasticsearch/pull/85649#issuecomment-1122258873
- [x] Figure out final API to turn this on (@qhoxie)
- [x] If it stays
synthetic: true
then we'll have to fail mappings that containenabled: false, synthetic: true
#87270 - [x] Flip API from
synthetic: true
tosynthetic: strict
so we have room to add more later. We totally will. @romseygeek - [x] Rally track https://github.com/elastic/rally-tracks/pull/268 + https://github.com/elastic/rally-tracks/pull/270
- [x] Resolve remaining round trip tests #86760
- [x] Realtime GET should synthesize on load to return consistent results. (#87578)
- [x] Add an option to _search to simulate synthetic source (#87068)
- [x] Support the simulate option on GET and MGET (#87536 + #87574)
- [x] Make sure we throw an error if you try to enable or disable synthetic source on an index #87182
- [x] Support "subobjects" https://github.com/elastic/elasticsearch/pull/86166 (#87261)
- [x] Docs #87416
- [x] Figure out highlighting #87667
Later
- [x] Make sure there is a nice error message when scripts try to access synethic source - it won't be there. They should use doc values or the fancy new fields API. @tmgordeeva #88334
- [ ] Add support for more field types
- [x]
aggregate_metric_double
field type #88909 - [x]
constant_keyword
#88603 - [x]
dense_vector
#89840 - [x]
histogram
#89833 - [x]
keyword
fields withignore_above
(#87480 + #89466) - [x]
match_only_text
#89516 - [x]
version
#89706
- [x]
- [ ] Support
fields
in runtime fields scripts- [x] Numbers #89888
- [x]
ip
#89888 - [ ]
text
(#89950 + more) - [ ]
keyword
(#89950) - [ ]
match_only_text
(#89950)
- [x] Support loading from stored fields (text would love it!) #87480
- [ ] Rally tests for random fetch
- [x] Look into the
enrich
processor (#89554) - [ ] Improve performance of synthesis
- [x] General #87882
- [x] Load column-wise #87930 #88025
- [ ] Parallel loading?
- [ ] Make
fields
API aware of synthetic-ness and go to doc values rather than rebuilding_source
if_source
isn't separately needed. - [ ] Document best practices for load over synthetic source
- [ ] Support for
ignore_malformed
#90007- [ ]
ip
#90038 - [ ]
numeric
- [ ]
geo_point
- [ ]
Much later
- [ ] Synthesize instead of using
_recovery_source
- we find that it'd improve write performance by ~11%. We'd have to synthesize on load instead. That's pretty slow. We'd love the 11% but we have to be careful here.
Pinging @elastic/es-search (Team:Search)
Pinging @elastic/es-analytics-geo (Team:Analytics)
@nik9000 does synthetic source leverage _source_include
/_source_exclude
for the fields it has to synthesize?
@nik9000 does synthetic source leverage
_source_include
/_source_exclude
for the fields it has to synthesize?
It does not. There is no support at the moment for any kind of partial synthesis.
Awesome feature, can't wait to have this in GA!!
Hello @nik9000 , can I pick some of the unchecked subtasks?
Hello @nik9000 , can I pick some of the unchecked subtasks?
I think all of the unchecked tasks are quick difficult to be honest. ignore_malformed
are maybe easier, but I wouldn't suggest picking it up.
Also you'd need a committer buddy and I've had to move on to other tasks sadly. That might be quite difficult to find too.
@nik9000 does synthetic source leverage
_source_include
/_source_exclude
for the fields it has to synthesize?It does not. There is no support at the moment for any kind of partial synthesis.
Hi @nik9000 - just for my own clarity. You can either use mode: synthetic on its own or use the _source_include
/_source_exclude
? But the two cannot be combined ? Is this correct ?
Hi @nik9000 - just for my own clarity. You can either use mode: synthetic on its own or use the
_source_include
/_source_exclude
? But the two cannot be combined ? Is this correct ?
Right. I honestly didn't know how to combine them so I just declared combining them to be incompatible.
Keep in mind synthetic source is only GA for time series indices and data streams. I've had to move on to other things but expect folks will get back to working on getting synthetic source good in more contexts at some point soon.
Pinging @elastic/es-storage-engine (Team:StorageEngine)