OpenSearch [RFC] Introduce concurrent translog recovery to accelerate segment replication primary promotion

Background

The purpose of this RFC is to discuss the solution to the long primary promotion time for segment replication mentioned in https://github.com/opensearch-project/OpenSearch/issues/20118. Primary promotion may take up to 15 minutes or even longer. During this period, write throughput significantly decreases.

Reproduction

Primary promotion takes a long time

The test cluster includes 4 16U64G data nodes, with segment replication and merged segment warmer enabled. Execute the following command.

opensearch-benchmark execute-test --target-hosts http://[fdbd:dc05:b:235::33]:9463 --workload=nyc_taxis --workload-params="search_clients: 4, bulk_indexing_clients: 32, number_of_shards:4, number_of_replicas: 1" --kill-running-processes

After restarting the data-0 node, a write throughput decline lasting more than 15 minutes can be observed.

Through the logs, we can see that a total of 3,911,465 operations were recovered from the translog.

[2025-11-22T01:29:43,325][TRACE][o.o.i.e.Engine           ] [byte-es-test-native-data-1] [nyc_taxis][1] flushing post recovery from translog: ops recovered [3911465], current translog generation [1693]

Analysis

Reasons for the long translog recovery time

In the segment replication scenario, replicas only record the translog. The default refresh interval for the nyc_taxis is 30 seconds. In a production environment, a refresh interval of 30 to 60 seconds is also a common configuration. This means that any document that has not been synchronized through segment replication within the refresh interval will be recovered via the translog during the primary promotion process.

Solution

Accelerate translog recovery

Add a dedicated thread pool translog_recovery for translog recovery.
Provides dynamic configuration at the index level to enable concurrent recovery of translog and specify the number of translog operations processed by a single thread.
When performing translog recovery, if the total number of translogs exceeds the threshold for single-threaded processing, it will be split into multiple threads for concurrent execution.
We can consider first restricting the concurrency mechanism of translog recovery to the segment replication scenario. Meanwhile, some backoff strategies can be considered, such as falling back to the current logic when the translog_recovery thread pool is busy.

Evaluation

Recovery speed increased by 13 times

Using the same test, enable translog recovery concurrency, with each thread handling 200,000 operations.

After restarting the data-0 node, the write throughput decline lasted 1min7s

Through the logs, we can see that a total of 3,804,928 operations were recovered from the translog.

[2025-11-27T14:45:17,623][TRACE][o.o.i.e.Engine           ] [byte-es-test-native-data-1] [nyc_taxis][1] flushing post recovery from translog: ops recovered [3804928], current translog generation [1285]

During this period, the newly introduced translog recovery thread can be observed working.

Related component

Indexing:Replication

Describe alternatives you've considered

No response

Additional context

No response

Dec 01 '25 03:12 guojialiang92

In order to evaluate the resource overhead of translog concurrent recovery, I conducted several sets of tests.

Method

The testing method is to restart the node where the primary shard is located after running the benchmark for a period of time. By varying the number of operations processed by a single thread, I observed the time taken for translog recovery and CPU utilization.

Environment

4 16U64G data nodes
nyc_taxis, 2 primary shard and 1 replica

Command

opensearch-benchmark execute-test --target-hosts http://[fdbd:dc05:b:217::52]:9330 --workload=nyc_taxis --workload-params="search_clients: 4, bulk_indexing_clients: 32, number_of_shards:2, number_of_replicas: 1" --kill-running-processes

Result

Default concurrency strategy

Translog concurrent recovery is restricted to use in the primary promotion scenario of segment replication.
Independent translog_recovery thread pool, with a size equal to the number of cores, configured at the node level.
The switch for enabling translog concurrent recovery can be dynamically adjusted through cluster-level configuration.
Single-threaded processing defaults to handling 500,000 operations and can be dynamically adjusted through cluster-level configuration.
When the number of translog recovery operations is less than the single-threaded threshold, concurrency is not enabled, remaining consistent with the current execution logic.
When the number of translog recovery operations exceeds the single-thread threshold.
- If translog_recovery is not busy, enable concurrency, with each thread responsible for executing the number of operations specified by the single-thread threshold.
- If translog_recovery is busy, do not enable concurrency, keeping it consistent with the current execution logic.

Dec 04 '25 09:12 guojialiang92

@Bukhtawar @sachinpkale @gbbafna @mch2 @ashking94 @shourya035 @linuxpi - Do take a look and give your feedback:)

Dec 04 '25 09:12 guojialiang92

@bugmakerrrrrr I submitted a PR and would like to invite you to help review it, thank you :)

Dec 17 '25 07:12 guojialiang92