[RFC] Introduce concurrent translog recovery to accelerate segment replication primary promotion
Background
The purpose of this RFC is to discuss the solution to the long primary promotion time for segment replication mentioned in https://github.com/opensearch-project/OpenSearch/issues/20118. Primary promotion may take up to 15 minutes or even longer. During this period, write throughput significantly decreases.
Reproduction
Primary promotion takes a long time
- The test cluster includes
4 16U64Gdata nodes, with segment replication and merged segment warmer enabled. Execute the following command.
opensearch-benchmark execute-test --target-hosts http://[fdbd:dc05:b:235::33]:9463 --workload=nyc_taxis --workload-params="search_clients: 4, bulk_indexing_clients: 32, number_of_shards:4, number_of_replicas: 1" --kill-running-processes
- After restarting the data-0 node, a write throughput decline lasting more than
15minutes can be observed.
- Through the logs, we can see that a total of
3,911,465operations were recovered from the translog.
[2025-11-22T01:29:43,325][TRACE][o.o.i.e.Engine ] [byte-es-test-native-data-1] [nyc_taxis][1] flushing post recovery from translog: ops recovered [3911465], current translog generation [1693]
Analysis
Reasons for the long translog recovery time
In the segment replication scenario, replicas only record the translog. The default refresh interval for the nyc_taxis is 30 seconds. In a production environment, a refresh interval of 30 to 60 seconds is also a common configuration. This means that any document that has not been synchronized through segment replication within the refresh interval will be recovered via the translog during the primary promotion process.
Solution
Accelerate translog recovery
- Add a dedicated thread pool
translog_recoveryfor translog recovery. - Provides dynamic configuration at the index level to enable concurrent recovery of translog and specify the number of translog operations processed by a single thread.
- When performing translog recovery, if the total number of translogs exceeds the threshold for single-threaded processing, it will be split into multiple threads for concurrent execution.
- We can consider first restricting the concurrency mechanism of translog recovery to the segment replication scenario. Meanwhile, some backoff strategies can be considered, such as falling back to the current logic when the
translog_recoverythread pool is busy.
Evaluation
Recovery speed increased by 13 times
Using the same test, enable translog recovery concurrency, with each thread handling 200,000 operations.
- After restarting the data-0 node, the write throughput decline lasted
1min7s
- Through the logs, we can see that a total of
3,804,928operations were recovered from the translog.
[2025-11-27T14:45:17,623][TRACE][o.o.i.e.Engine ] [byte-es-test-native-data-1] [nyc_taxis][1] flushing post recovery from translog: ops recovered [3804928], current translog generation [1285]
- During this period, the newly introduced translog recovery thread can be observed working.
Related component
Indexing:Replication
Describe alternatives you've considered
No response
Additional context
No response
In order to evaluate the resource overhead of translog concurrent recovery, I conducted several sets of tests.
Method
The testing method is to restart the node where the primary shard is located after running the benchmark for a period of time. By varying the number of operations processed by a single thread, I observed the time taken for translog recovery and CPU utilization.
Environment
4 16U64Gdata nodes- nyc_taxis,
2primary shard and1replica
Command
opensearch-benchmark execute-test --target-hosts http://[fdbd:dc05:b:217::52]:9330 --workload=nyc_taxis --workload-params="search_clients: 4, bulk_indexing_clients: 32, number_of_shards:2, number_of_replicas: 1" --kill-running-processes
Result
Default concurrency strategy
- Translog concurrent recovery is restricted to use in the primary promotion scenario of segment replication.
- Independent
translog_recoverythread pool, with a size equal to the number of cores, configured at the node level. - The switch for enabling translog concurrent recovery can be dynamically adjusted through cluster-level configuration.
- Single-threaded processing defaults to handling
500,000operations and can be dynamically adjusted through cluster-level configuration. - When the number of translog recovery operations is less than the single-threaded threshold, concurrency is not enabled, remaining consistent with the current execution logic.
- When the number of translog recovery operations exceeds the single-thread threshold.
- If
translog_recoveryis not busy, enable concurrency, with each thread responsible for executing the number of operations specified by the single-thread threshold. - If
translog_recoveryis busy, do not enable concurrency, keeping it consistent with the current execution logic.
- If
@Bukhtawar @sachinpkale @gbbafna @mch2 @ashking94 @shourya035 @linuxpi - Do take a look and give your feedback:)
@bugmakerrrrrr I submitted a PR and would like to invite you to help review it, thank you :)