[Feature Request] Capability to remove _recovery_source per field
Is your feature request related to a problem? Please describe
While doing the indexing all the fields which are getting ingested in Opensearch are stored as _source. If user requires they can disable the _source per field or completely for all the fields. But if user does this, the _recovery_source gets added(ref), which gets removed later on.
So overall the whole payload will still be used as a StoredField and impacts the indexing time. The impact on indexing time is high if one of the field is a vector field. In my experiments with 768D 1M dataset I can see a 50% reduction in indexing latency at p90 level.
Benchmarking Results
Below are the benchmarking results. Refer Appendix A for the flame graphs
Cluster Configuration
| Data Nodes | r5.2xlarge |
|---|---|
| Data Nodes Count | 3 |
| Dataset | cohere-768D |
| Number of Vectors | 1M |
| Tool | OSB |
| Indexing Clients | 1 |
| Number of Shards | 3 |
| Index Thread Qty | 1 |
| Engine | Nmslib |
| EF Construction | 256 |
| EF Search | 256 |
Baseline Results
| Metric | Task | Value | Unit |
|---|---|---|---|
| Cumulative indexing time of primary shards | 6.3266 | min | |
| Min cumulative indexing time across primary shards | 0 | min | |
| Median cumulative indexing time across primary shards | 1.99598 | min | |
| Max cumulative indexing time across primary shards | 2.27533 | min | |
| Cumulative indexing throttle time of primary shards | 0 | min | |
| Min cumulative indexing throttle time across primary shards | 0 | min | |
| Median cumulative indexing throttle time across primary shards | 0 | min | |
| Max cumulative indexing throttle time across primary shards | 0 | min | |
| Cumulative merge time of primary shards | 45.0456 | min | |
| Cumulative merge count of primary shards | 7 | ||
| Min cumulative merge time across primary shards | 0 | min | |
| Median cumulative merge time across primary shards | 14.4361 | min | |
| Max cumulative merge time across primary shards | 15.8919 | min | |
| Cumulative merge throttle time of primary shards | 0.168133 | min | |
| Min cumulative merge throttle time across primary shards | 0 | min | |
| Median cumulative merge throttle time across primary shards | 0.0523167 | min | |
| Max cumulative merge throttle time across primary shards | 0.0632667 | min | |
| Cumulative refresh time of primary shards | 6.0824 | min | |
| Cumulative refresh count of primary shards | 144 | ||
| Min cumulative refresh time across primary shards | 0 | min | |
| Median cumulative refresh time across primary shards | 1.90423 | min | |
| Max cumulative refresh time across primary shards | 2.14323 | min | |
| Cumulative flush time of primary shards | 23.2522 | min | |
| Cumulative flush count of primary shards | 32 | ||
| Min cumulative flush time across primary shards | 0 | min | |
| Median cumulative flush time across primary shards | 7.4859 | min | |
| Max cumulative flush time across primary shards | 8.08212 | min | |
| Total Young Gen GC time | 0.927 | s | |
| Total Young Gen GC count | 39 | ||
| Total Old Gen GC time | 0 | s | |
| Total Old Gen GC count | 0 | ||
| Store size | 12.4152 | GB | |
| Translog size | 5.6345e-07 | GB | |
| Min Throughput | custom-vector-bulk | 885.98 | docs/s |
| Mean Throughput | custom-vector-bulk | 1017.74 | docs/s |
| Median Throughput | custom-vector-bulk | 1007.87 | docs/s |
| Max Throughput | custom-vector-bulk | 1082.07 | docs/s |
| 50th percentile latency | custom-vector-bulk | 44.5227 | ms |
| 90th percentile latency | custom-vector-bulk | 59.4799 | ms |
| 99th percentile latency | custom-vector-bulk | 185.057 | ms |
| 99.9th percentile latency | custom-vector-bulk | 335.08 | ms |
| 99.99th percentile latency | custom-vector-bulk | 490.821 | ms |
| 100th percentile latency | custom-vector-bulk | 545.323 | ms |
Removing Recovering Source and _source
POC code: https://github.com/navneet1v/OpenSearch/commit/6c5896a69dd44b3076e9211e2e98ff87cc29ee65
| Metric | Task | Value | Unit |
|---|---|---|---|
| Cumulative indexing time of primary shards | 5.49067 | min | |
| Min cumulative indexing time across primary shards | 1.71528 | min | |
| Median cumulative indexing time across primary shards | 1.77818 | min | |
| Max cumulative indexing time across primary shards | 1.9972 | min | |
| Cumulative indexing throttle time of primary shards | 0 | min | |
| Min cumulative indexing throttle time across primary shards | 0 | min | |
| Median cumulative indexing throttle time across primary shards | 0 | min | |
| Max cumulative indexing throttle time across primary shards | 0 | min | |
| Cumulative merge time of primary shards | 0.3878 | min | |
| Cumulative merge count of primary shards | 3 | ||
| Min cumulative merge time across primary shards | 0.118133 | min | |
| Median cumulative merge time across primary shards | 0.133017 | min | |
| Max cumulative merge time across primary shards | 0.13665 | min | |
| Cumulative merge throttle time of primary shards | 0 | min | |
| Min cumulative merge throttle time across primary shards | 0 | min | |
| Median cumulative merge throttle time across primary shards | 0 | min | |
| Max cumulative merge throttle time across primary shards | 0 | min | |
| Cumulative refresh time of primary shards | 13.38 | min | |
| Cumulative refresh count of primary shards | 86 | ||
| Min cumulative refresh time across primary shards | 4.0509 | min | |
| Median cumulative refresh time across primary shards | 4.11963 | min | |
| Max cumulative refresh time across primary shards | 5.20948 | min | |
| Cumulative flush time of primary shards | 31.9997 | min | |
| Cumulative flush count of primary shards | 27 | ||
| Min cumulative flush time across primary shards | 10.0862 | min | |
| Median cumulative flush time across primary shards | 10.2099 | min | |
| Max cumulative flush time across primary shards | 11.7035 | min | |
| Store size | 11.7547 | GB | |
| Translog size | 1.19775 | GB | |
| Min Throughput | custom-vector-bulk | 932.63 | docs/s |
| Mean Throughput | custom-vector-bulk | 1133.94 | docs/s |
| Median Throughput | custom-vector-bulk | 1131.69 | docs/s |
| Max Throughput | custom-vector-bulk | 1167.02 | docs/s |
| 50th percentile latency | custom-vector-bulk | 39.0611 | ms |
| 90th percentile latency | custom-vector-bulk | 47.5916 | ms |
| 99th percentile latency | custom-vector-bulk | 83.2863 | ms |
| 99.9th percentile latency | custom-vector-bulk | 177.238 | ms |
| 99.99th percentile latency | custom-vector-bulk | 288.076 | ms |
| 100th percentile latency | custom-vector-bulk | 319.616 | ms |
| error rate | custom-vector-bulk | 0 | % |
Describe the solution you'd like
Just like _source where we can specify what fields are included/excluded in _source or completely disable _source, I was thinking to have same capability for _recovery_source. This will ensure that users can remove their fields from _recovery source if required.
Related component
Indexing:Performance
Describe alternatives you've considered
In terms of alternative there is no alternative to disable the _recovery_source for an index.
FAQ
Q1: If a user needs vector field how they can retrieve the vector field? also Recovery source and _source is used for other purpose like update by query, disaster recovery etc etc, how we are going to support that?
In k-NN repo we are working on a PR which will ensure that we read k-NN vector field from doc values. Ref: https://github.com/opensearch-project/k-NN/pull/1571. For other fields that require such capabilities can be added in core in incremental fashion if needed.
Appendix A
Flame graph when _source/_recovery_source is getting stored during indexing for vector fields.
Flame graph when _source and _recovery_source is not getting stored.
I can work on building this feature.
@msfroh, @reta , @dblock
Updated the flame graphs in the appendix section of the issue.
Thanks @navneet1v for sharing these insights.
- What is the compression ratio we see for our benchmark corpus documents for stored fields when compared to the raw document data?
- I'm wondering if the inherent problem here is the compression algorithm (looking at the flame graphs). Have you already experimented with disabling compression/trying zstd on stored fields to see what kind of behaviour they show?
_recovery_sourcewas introduced to support ops based recovery through Lucene after soft deletes were enabled so that we can rely on Lucene changes snapshot. How will you continue to support this with_recovery_sourcedisabled?
Just like _source where we can specify what fields are included/excluded in _source or completely disable _source, I was thinking to have same capability for _recovery_source. This will ensure that users can remove their fields from _recovery source if required.
Today, in case original source differs from source being written (due to including/excluding any specific fields), recovery source is always written using original source to ensure ops based recovery. Won't this change divert from the intent it was serving?
@mgodwan I have ans this question in the FAQ section of the above issue. For vector field we are working on a PR which will allow the vector field values to be read from doc values and put in _source response. This will ensure that recovery is possible after a crash.
Plus given that _recovery_source gets deleted after a certain point of time recovery source was never a full proof way to recover from crashes.
Plus given that _recovery_source gets deleted after a certain point of time recovery source was never a full proof way to recover from crashes.
If I understand the feature correctly, I don't think this was the use case for it ever.
It is used in case of primary-replica syncing (similar to how translog had been used in the past). Please evaluate how it will continue to function after this change to disable the _recovery_source completely as since 2.x, it is the default way for syncing replicas with primary.
Please evaluate how it will continue to function after this change to disable the _recovery_source completely as since 2.x, it is the default way for syncing replicas with primary.
Is there a way I can check this? any IT or anything else that can help me here.
@mgodwan