[BUG] Cross cluster replication fails to allocate shard on follower cluster
What is the bug? Replication does not start. Shard fails to allocate.
How can one reproduce the bug? Steps to reproduce the behavior:
PUT _plugins/_replication/proxy-2024.11/_start { "leader_alias": "main-cluster", "leader_index": "proxy-2024.11", "use_roles":{ "leader_cluster_role": "all_access", "follower_cluster_role": "all_access" } }
- See error GET _plugins/_replication/proxy-2024.11/_status { "status": "FAILED", "reason": "", "leader_alias": "prod-mon-elk-muc", "leader_index": "proxy-2024.11", "follower_index": "proxy-2024.11" }
What is the expected behavior? Replication should start
What is your host/environment?
- OS: Ubuntu 22
- Opensearch 2.18 docker image
- Plugins: just what is part of opensearch docker image
Do you have any screenshots? N/A
Do you have any additional context? Opensearch logs give lots java stack traces, but they all end with java.lang.IllegalStateException: confined:
Example stack log: [2024-11-25T16:51:05,128][ERROR][o.o.r.r.RemoteClusterRepository] [opensearch-node-114] Restore of [proxy-2024.11][0] failed due to java.lang.IllegalStateException: confined at org.apache.lucene.store.MemorySegmentIndexInput.ensureAccessible(MemorySegmentIndexInput.java:103) at org.apache.lucene.store.MemorySegmentIndexInput.buildSlice(MemorySegmentIndexInput.java:461) at org.apache.lucene.store.MemorySegmentIndexInput.clone(MemorySegmentIndexInput.java:425) at org.apache.lucene.store.MemorySegmentIndexInput$SingleSegmentImpl.clone(MemorySegmentIndexInput.java:530) at org.opensearch.replication.repository.RestoreContext.openInput(RestoreContext.kt:39) at org.opensearch.replication.repository.RemoteClusterRestoreLeaderService.openInputStream(RemoteClusterRestoreLeaderService.kt:76) at org.opensearch.replication.action.repository.TransportGetFileChunkAction$shardOperation$1.invoke(TransportGetFileChunkAction.kt:59) at org.opensearch.replication.action.repository.TransportGetFileChunkAction$shardOperation$1.invoke(TransportGetFileChunkAction.kt:57) at org.opensearch.replication.util.ExtensionsKt.performOp(Extensions.kt:55) at org.opensearch.replication.util.ExtensionsKt.performOp$default(Extensions.kt:52) at org.opensearch.replication.action.repository.TransportGetFileChunkAction.shardOperation(TransportGetFileChunkAction.kt:57) at org.opensearch.replication.action.repository.TransportGetFileChunkAction.shardOperation(TransportGetFileChunkAction.kt:33) at org.opensearch.action.support.single.shard.TransportSingleShardAction.lambda$asyncShardOperation$0(TransportSingleShardAction.java:131) at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74) at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1005) at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.lang.Thread.run(Thread.java:1583)
Followed by: [2024-11-25T16:51:05,135][ERROR][o.o.r.r.RemoteClusterRepository] [opensearch-node-114] Releasing leader resource failed due to NotSerializableExceptionWrapper[wrong_thread_exception: Attempted access outside owning thread] at jdk.internal.foreign.MemorySessionImpl.wrongThread(MemorySessionImpl.java:315) at jdk.internal.misc.ScopedMemoryAccess$ScopedAccessError.newRuntimeException(ScopedMemoryAccess.java:113) at jdk.internal.foreign.MemorySessionImpl.checkValidState(MemorySessionImpl.java:219) at jdk.internal.foreign.ConfinedSession.justClose(ConfinedSession.java:83) at jdk.internal.foreign.MemorySessionImpl.close(MemorySessionImpl.java:242) at jdk.internal.foreign.MemorySessionImpl$1.close(MemorySessionImpl.java:88) at org.apache.lucene.store.MemorySegmentIndexInput.close(MemorySegmentIndexInput.java:514) at org.opensearch.replication.repository.RestoreContext.close(RestoreContext.kt:52) at org.opensearch.replication.repository.RemoteClusterRestoreLeaderService.removeLeaderClusterRestore(RemoteClusterRestoreLeaderService.kt:142) at org.opensearch.replication.action.repository.TransportReleaseLeaderResourcesAction.shardOperation(TransportReleaseLeaderResourcesAction.kt:48) at org.opensearch.replication.action.repository.TransportReleaseLeaderResourcesAction.shardOperation(TransportReleaseLeaderResourcesAction.kt:31) at org.opensearch.action.support.single.shard.TransportSingleShardAction.lambda$asyncShardOperation$0(TransportSingleShardAction.java:131) at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74) at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1005) at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.lang.Thread.run(Thread.java:1583)
Setting: "plugins.replication.follower.index.recovery.chunk_size": "1gb", "plugins.replication.follower.index.recovery.max_concurrent_file_chunks": "1"
Seems to fix the issue. It appears that if the files are too large on the primary cluster, replication fails to start unless recovery.chunk_size is big enough to transfer files in one go.
It appears that setting 'plugins.replication.follower.index.recovery.chunk_size' to the max (which is 1gb) I can get all but one index to replicate. Would it be possible to raise the limit to higher than 1gb? As there appears to be something strange happening when the transfer of files needs to happen in chunks. The chunk with offset >0 will fail to get transferred with the 'java.lang.IllegalStateException: confined'
[Catch All Triage - 1, 2, 3, 4]
I am experiencing the same problem. After upgrading to OpenSearch 2.18.0 from 2.12.0, I am unable to create the replication. My index has a 10GB shard, and it cannot be transferred.
https://github.com/opensearch-project/cross-cluster-replication/issues/1482
There seems to be a similar issue. When setting up CCR on an index that already contains data, an error occurs, and the data fails to transfer.
https://github.com/opensearch-project/cross-cluster-replication/issues/1465#issuecomment-2499956562
yes right
I agree, and it seems to work fine when the chunk size is set sufficiently large. However, this does not appear to be a fundamental solution, and internal improvements seem to be necessary.
@dblock It seems like an important issue. Could you check it??
@krisfreedain I’m wondering if the CCR team is aware of this issue. It seems like a pretty critical bug, but no one seems to be paying attention to it.
Setting: "plugins.replication.follower.index.recovery.chunk_size": "1gb", "plugins.replication.follower.index.recovery.max_concurrent_file_chunks": "1"
Seems to fix the issue. It appears that if the files are too large on the primary cluster, replication fails to start unless recovery.chunk_size is big enough to transfer files in one go.
This solved my problem for now, I'll try with bigger indices. The ones that got stuck were only ~90MB and this was a test cluster setup to test CCR specifically. I want to replace a prod ES cluster with several 500GB indices. Wondering if this will work 🤔
Not sure if its related to this change: https://github.com/apache/lucene/commit/c8e05c8cd6e018133d3d739d159d5b3b6b29c179#diff-be8459728adce212557bc19afbe0a8b91c788a49b1379323af40e3ba3ec4e479R154
Can you retry after disabling this feature by adding this JVM arg: "-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false"
Might be related to https://github.com/opensearch-project/OpenSearch/pull/17502
We can do similar change here: https://github.com/opensearch-project/cross-cluster-replication/blob/main/src/main/kotlin/org/opensearch/replication/repository/RestoreContext.kt#L42
Might be related to https://github.com/opensearch-project/OpenSearch/pull/17502
You're right @ankitkala
This was introduced by https://github.com/apache/lucene/pull/13535, where MemorySegmentIndexInput switched to confined Areas for IOContext.READONCE cases. Apparently, this is not the case where multiple chunks (hence multiple threads, potentially) are involved.
Were you able to mitigate this by using jvm argument shared above? I'll check if this can be fixed for upcoming versions.
If you're referring to -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false then I don't think that setting exists anymore, as it was removed by https://github.com/apache/lucene/pull/13146 , so the only way to fix this is to change the code.
I was able to reproduce the problem locally by hacking some small values into REPLICATION_FOLLOWER_RECOVERY_CHUNK_SIZE setting and running BasicReplicationIT
https://github.com/opensearch-project/cross-cluster-replication/blob/4d04eb5025649f1337a907cf9049ac999ed7b759/src/main/kotlin/org/opensearch/replication/repository/RestoreContext.kt#L42
val input = directory.openInput(file.name, IOContext.READ) // instead of READONCE
The easiest solution, despite potential performance issues and additional GC overhead, is to change the IOContext mode.
READ
but i think it is not the best way
I can confirm that using IOContext.READ fixes the problem for me (must be installed on both sides, had some failed tries before realizing...)
@ankitkala https://github.com/opensearch-project/cross-cluster-replication/issues/1465#issuecomment-2777691141
how about this solution???
Setting IOContext.READ seems like the right way forward.
For historical reference, here is a more detailed explanation of the issue:
It appears that Apache Lucene has recently updated its code to leverage the Java Project Panama API to access native memory more efficiently. Project Panama provides a safer and more efficient way to interact with native code (e.g., C libraries) from the JVM. Its main goal is to improve upon the complexity and potential risks associated with traditional JNI (Java Native Interface).
Reference: https://openjdk.org/projects/panama/
While reviewing the internal code of Apache Lucene:
https://github.com/apache/lucene/blob/88d3573d3b459f8d980898837b403c89ba8b5986/lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInputProvider.java#L61
We can see that when the IOContext.READONCE option is used, Arena.ofConfined() is specified.
Segments created using Arena.ofConfined() can only be accessed by a single thread.
see : https://cr.openjdk.org/~pminborg/panama/21/v1/javadoc/java.base/java/lang/foreign/Arena.html
Returns a new confined arena, owned by the current thread.
However, in our current CCR architecture which is based on multi-threading, this caused an issue.
We have now modified the implementation to use Arena.ofShared() instead, which allows the segment to be accessed by multiple threads. This can be easily controlled via the IOContext.DEFAULT option.
In the long term, it would be worth testing whether the Arena.ofConfined() approach actually performs worse than the multi-threaded model. Using Arena.ofConfined() may reduce the need for explicit lock management and unnecessary buffer allocations, potentially lowering GC overhead.
c.c @ankitkala
This is happening to us on AWS-managed Opensearch 2.19 clusters with quite small indices (4.2gb primaries, 10,423,694 docs) @ankitkala what's the lowest OS version where this is fixed?
@luisfavila hi
The issue has been fixed in version 2.19.2 and later. Could you try using version 2.19.2 or higher?
@10000-ki Thank you! Trying to get AWS support to upgrade us to that - they don't let us select minor versions (only 2.17, 2.18, 2.19) with the latest being 2.19, which we're already on and where the bug is still present, so no idea.
Edit: AWS said they won't help unless we subscribe for Business support, ahah. Stay away from managed solutions people.
Shouldn't the PR be listed in the release notes? https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.19.2.md
@luisfavila https://github.com/opensearch-project/cross-cluster-replication/commits/2.19/
see
if u can't choice the patch version, u have to use opensearch v3.0.0 maybe...