OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[BUG] Segment replication failure

Open disaster37 opened this issue 1 year ago • 2 comments

Describe the bug

After enable Segmentation replication, I get some error logs about segmentation replication failed. It seems there are link with ISM policy, when force_merge action occur to put only one segment per shard.

The logs:

{"type": "json_logger", "timestamp": "2024-09-30T18:58:29,990Z", "level": "ERROR", "component": "o.o.i.r.SegmentReplicationTargetService", "cluster.name": "logmanagement2-prd", "node.name": "opensearch-hot2-os-1", "message": "[shardId [.ds-logs-log-default-000016][0]] [replication id 220112] Replication failed, timing data: {INIT=0, GET_CHECKPOINT_INFO=0, FILE_DIFF=31, REPLICATING=0, GET_FILES=0}", "cluster.uuid": "ZVJvcMA3TmK00qqKawBTjg", "node.id": "SwXDwI_ZSQWvhLuUisEhWA" , 
"stacktrace": ["org.opensearch.indices.replication.common.ReplicationFailedException: Segment Replication failed",
"at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:602) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-2.16.0.jar:2.16.0]",
"at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:84) [opensearch-core-2.16.0.jar:2.16.0]",
"at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-2.16.0.jar:2.16.0]",
"at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) [?:?]",
"at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:79) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.core.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:58) [opensearch-core-2.16.0.jar:2.16.0]",
"at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:70) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleResponse(SecurityInterceptor.java:404) [opensearch-security-2.16.0.0.jar:2.16.0.0]",
"at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1505) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.transport.NativeMessageHandler.doHandleResponse(NativeMessageHandler.java:427) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.transport.NativeMessageHandler.lambda$handleResponse$3(NativeMessageHandler.java:421) [opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:882) [opensearch-2.16.0.jar:2.16.0]",
"at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]",
"at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]",
"at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]",
"Caused by: org.opensearch.indices.replication.common.ReplicationFailedException: java.nio.file.NoSuchFileException: /usr/share/opensearch/data/nodes/0/indices/bpdiNAx7SE65L0gmSvzrHg/0/index/_1a93.si",
"at org.opensearch.indices.replication.SegmentReplicationTarget.finalizeReplication(SegmentReplicationTarget.java:326) ~[opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$3(SegmentReplicationTarget.java:194) ~[opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-2.16.0.jar:2.16.0]",
"... 20 more",
"Caused by: java.nio.file.NoSuchFileException: /usr/share/opensearch/data/nodes/0/indices/bpdiNAx7SE65L0gmSvzrHg/0/index/_1a93.si",
"at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]",
"at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]",
"at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]",
"at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:224) ~[?:?]",
"at java.base/java.nio.channels.FileChannel.open(FileChannel.java:309) ~[?:?]",
"at java.base/java.nio.channels.FileChannel.open(FileChannel.java:369) ~[?:?]",
"at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:78) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"at org.opensearch.index.store.FsDirectoryFactory$HybridDirectory.openInput(FsDirectoryFactory.java:183) ~[opensearch-2.16.0.jar:2.16.0]",
"at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:156) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"at org.apache.lucene.codecs.lucene99.Lucene99SegmentInfoFormat.read(Lucene99SegmentInfoFormat.java:94) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"at org.apache.lucene.index.SegmentInfos.parseSegmentInfos(SegmentInfos.java:411) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:368) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:315) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"at org.opensearch.index.store.Store.buildSegmentInfos(Store.java:820) ~[opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.indices.replication.SegmentReplicationTarget.finalizeReplication(SegmentReplicationTarget.java:295) ~[opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$3(SegmentReplicationTarget.java:194) ~[opensearch-2.16.0.jar:2.16.0]",
"at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-2.16.0.jar:2.16.0]",
"... 20 more",
"Suppressed: org.apache.lucene.index.CorruptIndexException: checksum passed (d0caa28b). possibly transient resource issue, or a Lucene or JVM bug (resource=BufferedChecksumIndexInput(Snapshot of SegmentInfos))",
"\tat org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:501) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"\tat org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:375) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"\tat org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:315) ~[lucene-core-9.11.1.jar:9.11.1 0c087dfdd10e0f6f3f6faecc6af4415e671a9e69 - 2024-06-23 12:31:02]",
"\tat org.opensearch.index.store.Store.buildSegmentInfos(Store.java:820) ~[opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.indices.replication.SegmentReplicationTarget.finalizeReplication(SegmentReplicationTarget.java:295) ~[opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$3(SegmentReplicationTarget.java:194) ~[opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.16.0.jar:2.16.0]",
"\tat org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-2.16.0.jar:2.16.0]",
"\tat java.base/java.util.ArrayList.forEach(ArrayList.java:1596) [?:?]",
"\tat org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.action.StepListener.innerOnResponse(StepListener.java:79) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.core.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:58) [opensearch-core-2.16.0.jar:2.16.0]",
"\tat org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:70) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleResponse(SecurityInterceptor.java:404) [opensearch-security-2.16.0.0.jar:2.16.0.0]",
"\tat org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1505) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.transport.NativeMessageHandler.doHandleResponse(NativeMessageHandler.java:427) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.transport.NativeMessageHandler.lambda$handleResponse$3(NativeMessageHandler.java:421) [opensearch-2.16.0.jar:2.16.0]",
"\tat org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:882) [opensearch-2.16.0.jar:2.16.0]",
"\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]",
"\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]",
"\tat java.base/java.lang.Thread.run(Thread.java:1583) [?:?]"] }

Related component

Indexing:Replication

To Reproduce

  1. Deploy Opensearch cluster with 3 nodes 2.16.0
  2. Enable Segmentation replication on cluster (cluster.indices.replication.strategy: 'SEGMENT') I have also enable replication back pressure (segrep.pressure.enabled)
  3. Create ISM policy to rolleover index after primery shard have 50Go. Start force_merge action with 1 segment after rolleover
  4. Create datastream index where ISM policy is affected on

When force_merge action is run, you will get the logs bellow. More over, the force_merge action will go on timeout because of replica shard never have one segment (I look that with (GET _cat/segments/.ds-logs-log-default-000016?v)

Expected behavior

Segmentation replication on replica work when force_merge run.

Additional Details

Plugins Please list all plugins currently enabled.

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context Add any other context about the problem here.

disaster37 avatar Oct 01 '24 07:10 disaster37

Moving to storage remote

[Triage attendees - 1 2 3 4]

bharath-techie avatar Oct 14 '24 15:10 bharath-techie

[Storage Triage] - Attendees 1 2 3 4

Thanks @disaster37 for filing this issue

  1. Are you observing this issue repeatedly ?
  2. Can you please verify your logs if segment replication is failing before force merge as well ? You can use cat segment replication apis to find out the replication lag as well .

gbbafna avatar Oct 17 '24 15:10 gbbafna