OpenSearch [Feature Request] Capability to remove _recovery

Is your feature request related to a problem? Please describe

While doing the indexing all the fields which are getting ingested in Opensearch are stored as _source. If user requires they can disable the _source per field or completely for all the fields. But if user does this, the _recovery_source gets added(ref), which gets removed later on.

So overall the whole payload will still be used as a StoredField and impacts the indexing time. The impact on indexing time is high if one of the field is a vector field. In my experiments with 768D 1M dataset I can see a 50% reduction in indexing latency at p90 level.

Benchmarking Results

Below are the benchmarking results. Refer Appendix A for the flame graphs

Cluster Configuration

Data Nodes	r5.2xlarge
Data Nodes Count	3
Dataset	cohere-768D
Number of Vectors	1M
Tool	OSB
Indexing Clients	1
Number of Shards	3
Index Thread Qty	1
Engine	Nmslib
EF Construction	256
EF Search	256

Baseline Results

Metric	Task	Value	Unit
Cumulative indexing time of primary shards		6.3266	min
Min cumulative indexing time across primary shards		0	min
Median cumulative indexing time across primary shards		1.99598	min
Max cumulative indexing time across primary shards		2.27533	min
Cumulative indexing throttle time of primary shards		0	min
Min cumulative indexing throttle time across primary shards		0	min
Median cumulative indexing throttle time across primary shards		0	min
Max cumulative indexing throttle time across primary shards		0	min
Cumulative merge time of primary shards		45.0456	min
Cumulative merge count of primary shards		7
Min cumulative merge time across primary shards		0	min
Median cumulative merge time across primary shards		14.4361	min
Max cumulative merge time across primary shards		15.8919	min
Cumulative merge throttle time of primary shards		0.168133	min
Min cumulative merge throttle time across primary shards		0	min
Median cumulative merge throttle time across primary shards		0.0523167	min
Max cumulative merge throttle time across primary shards		0.0632667	min
Cumulative refresh time of primary shards		6.0824	min
Cumulative refresh count of primary shards		144
Min cumulative refresh time across primary shards		0	min
Median cumulative refresh time across primary shards		1.90423	min
Max cumulative refresh time across primary shards		2.14323	min
Cumulative flush time of primary shards		23.2522	min
Cumulative flush count of primary shards		32
Min cumulative flush time across primary shards		0	min
Median cumulative flush time across primary shards		7.4859	min
Max cumulative flush time across primary shards		8.08212	min
Total Young Gen GC time		0.927	s
Total Young Gen GC count		39
Total Old Gen GC time		0	s
Total Old Gen GC count		0
Store size		12.4152	GB
Translog size		5.6345e-07	GB
Min Throughput	custom-vector-bulk	885.98	docs/s
Mean Throughput	custom-vector-bulk	1017.74	docs/s
Median Throughput	custom-vector-bulk	1007.87	docs/s
Max Throughput	custom-vector-bulk	1082.07	docs/s
50th percentile latency	custom-vector-bulk	44.5227	ms
90th percentile latency	custom-vector-bulk	59.4799	ms
99th percentile latency	custom-vector-bulk	185.057	ms
99.9th percentile latency	custom-vector-bulk	335.08	ms
99.99th percentile latency	custom-vector-bulk	490.821	ms
100th percentile latency	custom-vector-bulk	545.323	ms

Removing Recovering Source and _source

POC code: https://github.com/navneet1v/OpenSearch/commit/6c5896a69dd44b3076e9211e2e98ff87cc29ee65

Metric	Task	Value	Unit
Cumulative indexing time of primary shards		5.49067	min
Min cumulative indexing time across primary shards		1.71528	min
Median cumulative indexing time across primary shards		1.77818	min
Max cumulative indexing time across primary shards		1.9972	min
Cumulative indexing throttle time of primary shards		0	min
Min cumulative indexing throttle time across primary shards		0	min
Median cumulative indexing throttle time across primary shards		0	min
Max cumulative indexing throttle time across primary shards		0	min
Cumulative merge time of primary shards		0.3878	min
Cumulative merge count of primary shards		3
Min cumulative merge time across primary shards		0.118133	min
Median cumulative merge time across primary shards		0.133017	min
Max cumulative merge time across primary shards		0.13665	min
Cumulative merge throttle time of primary shards		0	min
Min cumulative merge throttle time across primary shards		0	min
Median cumulative merge throttle time across primary shards		0	min
Max cumulative merge throttle time across primary shards		0	min
Cumulative refresh time of primary shards		13.38	min
Cumulative refresh count of primary shards		86
Min cumulative refresh time across primary shards		4.0509	min
Median cumulative refresh time across primary shards		4.11963	min
Max cumulative refresh time across primary shards		5.20948	min
Cumulative flush time of primary shards		31.9997	min
Cumulative flush count of primary shards		27
Min cumulative flush time across primary shards		10.0862	min
Median cumulative flush time across primary shards		10.2099	min
Max cumulative flush time across primary shards		11.7035	min
Store size		11.7547	GB
Translog size		1.19775	GB
Min Throughput	custom-vector-bulk	932.63	docs/s
Mean Throughput	custom-vector-bulk	1133.94	docs/s
Median Throughput	custom-vector-bulk	1131.69	docs/s
Max Throughput	custom-vector-bulk	1167.02	docs/s
50th percentile latency	custom-vector-bulk	39.0611	ms
90th percentile latency	custom-vector-bulk	47.5916	ms
99th percentile latency	custom-vector-bulk	83.2863	ms
99.9th percentile latency	custom-vector-bulk	177.238	ms
99.99th percentile latency	custom-vector-bulk	288.076	ms
100th percentile latency	custom-vector-bulk	319.616	ms
error rate	custom-vector-bulk	0	%

Describe the solution you'd like

Just like _source where we can specify what fields are included/excluded in _source or completely disable _source, I was thinking to have same capability for _recovery_source. This will ensure that users can remove their fields from _recovery source if required.

Related component

Indexing:Performance

Describe alternatives you've considered

In terms of alternative there is no alternative to disable the _recovery_source for an index.

FAQ

Q1: If a user needs vector field how they can retrieve the vector field? also Recovery source and _source is used for other purpose like update by query, disaster recovery etc etc, how we are going to support that?

In k-NN repo we are working on a PR which will ensure that we read k-NN vector field from doc values. Ref: https://github.com/opensearch-project/k-NN/pull/1571. For other fields that require such capabilities can be added in core in incremental fashion if needed.

Appendix A

Flame graph when _source/_recovery_source is getting stored during indexing for vector fields.

320730815-7752c503-1b97-42e7-885b-0dc505e7a43f

Flame graph when _source and _recovery_source is not getting stored.

320730873-877fced6-6e67-4199-b939-d3ef48570a94

May 01 '24 18:05 navneet1v

I can work on building this feature.

@msfroh, @reta , @dblock

May 01 '24 18:05 navneet1v

Updated the flame graphs in the appendix section of the issue.

May 01 '24 20:05 navneet1v

Thanks @navneet1v for sharing these insights.

What is the compression ratio we see for our benchmark corpus documents for stored fields when compared to the raw document data?
I'm wondering if the inherent problem here is the compression algorithm (looking at the flame graphs). Have you already experimented with disabling compression/trying zstd on stored fields to see what kind of behaviour they show?
_recovery_source was introduced to support ops based recovery through Lucene after soft deletes were enabled so that we can rely on Lucene changes snapshot. How will you continue to support this with _recovery_source disabled?

May 02 '24 09:05 mgodwan

Just like _source where we can specify what fields are included/excluded in _source or completely disable _source, I was thinking to have same capability for _recovery_source. This will ensure that users can remove their fields from _recovery source if required.

Today, in case original source differs from source being written (due to including/excluding any specific fields), recovery source is always written using original source to ensure ops based recovery. Won't this change divert from the intent it was serving?

May 02 '24 09:05 mgodwan

@mgodwan I have ans this question in the FAQ section of the above issue. For vector field we are working on a PR which will allow the vector field values to be read from doc values and put in _source response. This will ensure that recovery is possible after a crash.

Plus given that _recovery_source gets deleted after a certain point of time recovery source was never a full proof way to recover from crashes.

May 02 '24 18:05 navneet1v

Plus given that _recovery_source gets deleted after a certain point of time recovery source was never a full proof way to recover from crashes.

If I understand the feature correctly, I don't think this was the use case for it ever. It is used in case of primary-replica syncing (similar to how translog had been used in the past). Please evaluate how it will continue to function after this change to disable the _recovery_source completely as since 2.x, it is the default way for syncing replicas with primary.

May 13 '24 07:05 mgodwan

Please evaluate how it will continue to function after this change to disable the _recovery_source completely as since 2.x, it is the default way for syncing replicas with primary.

Is there a way I can check this? any IT or anything else that can help me here.

@mgodwan

May 29 '24 03:05 navneet1v

[Feature Request] Capability to remove _recovery_source per field

Is your feature request related to a problem? Please describe

Benchmarking Results

Cluster Configuration

Baseline Results

Removing Recovering Source and _source

Describe the solution you'd like

Related component

Describe alternatives you've considered

FAQ

Appendix A