cross-cluster-replication icon indicating copy to clipboard operation
cross-cluster-replication copied to clipboard

[BUG] Cross cluster replication fails to allocate shard on follower cluster

Open borutlukic opened this issue 1 year ago • 7 comments

What is the bug? Replication does not start. Shard fails to allocate.

How can one reproduce the bug? Steps to reproduce the behavior:

PUT _plugins/_replication/proxy-2024.11/_start { "leader_alias": "main-cluster", "leader_index": "proxy-2024.11", "use_roles":{ "leader_cluster_role": "all_access", "follower_cluster_role": "all_access" } }

  1. See error GET _plugins/_replication/proxy-2024.11/_status { "status": "FAILED", "reason": "", "leader_alias": "prod-mon-elk-muc", "leader_index": "proxy-2024.11", "follower_index": "proxy-2024.11" }

What is the expected behavior? Replication should start

What is your host/environment?

  • OS: Ubuntu 22
  • Opensearch 2.18 docker image
  • Plugins: just what is part of opensearch docker image

Do you have any screenshots? N/A

Do you have any additional context? Opensearch logs give lots java stack traces, but they all end with java.lang.IllegalStateException: confined:

Example stack log: [2024-11-25T16:51:05,128][ERROR][o.o.r.r.RemoteClusterRepository] [opensearch-node-114] Restore of [proxy-2024.11][0] failed due to java.lang.IllegalStateException: confined at org.apache.lucene.store.MemorySegmentIndexInput.ensureAccessible(MemorySegmentIndexInput.java:103) at org.apache.lucene.store.MemorySegmentIndexInput.buildSlice(MemorySegmentIndexInput.java:461) at org.apache.lucene.store.MemorySegmentIndexInput.clone(MemorySegmentIndexInput.java:425) at org.apache.lucene.store.MemorySegmentIndexInput$SingleSegmentImpl.clone(MemorySegmentIndexInput.java:530) at org.opensearch.replication.repository.RestoreContext.openInput(RestoreContext.kt:39) at org.opensearch.replication.repository.RemoteClusterRestoreLeaderService.openInputStream(RemoteClusterRestoreLeaderService.kt:76) at org.opensearch.replication.action.repository.TransportGetFileChunkAction$shardOperation$1.invoke(TransportGetFileChunkAction.kt:59) at org.opensearch.replication.action.repository.TransportGetFileChunkAction$shardOperation$1.invoke(TransportGetFileChunkAction.kt:57) at org.opensearch.replication.util.ExtensionsKt.performOp(Extensions.kt:55) at org.opensearch.replication.util.ExtensionsKt.performOp$default(Extensions.kt:52) at org.opensearch.replication.action.repository.TransportGetFileChunkAction.shardOperation(TransportGetFileChunkAction.kt:57) at org.opensearch.replication.action.repository.TransportGetFileChunkAction.shardOperation(TransportGetFileChunkAction.kt:33) at org.opensearch.action.support.single.shard.TransportSingleShardAction.lambda$asyncShardOperation$0(TransportSingleShardAction.java:131) at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74) at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1005) at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.lang.Thread.run(Thread.java:1583)

Followed by: [2024-11-25T16:51:05,135][ERROR][o.o.r.r.RemoteClusterRepository] [opensearch-node-114] Releasing leader resource failed due to NotSerializableExceptionWrapper[wrong_thread_exception: Attempted access outside owning thread] at jdk.internal.foreign.MemorySessionImpl.wrongThread(MemorySessionImpl.java:315) at jdk.internal.misc.ScopedMemoryAccess$ScopedAccessError.newRuntimeException(ScopedMemoryAccess.java:113) at jdk.internal.foreign.MemorySessionImpl.checkValidState(MemorySessionImpl.java:219) at jdk.internal.foreign.ConfinedSession.justClose(ConfinedSession.java:83) at jdk.internal.foreign.MemorySessionImpl.close(MemorySessionImpl.java:242) at jdk.internal.foreign.MemorySessionImpl$1.close(MemorySessionImpl.java:88) at org.apache.lucene.store.MemorySegmentIndexInput.close(MemorySegmentIndexInput.java:514) at org.opensearch.replication.repository.RestoreContext.close(RestoreContext.kt:52) at org.opensearch.replication.repository.RemoteClusterRestoreLeaderService.removeLeaderClusterRestore(RemoteClusterRestoreLeaderService.kt:142) at org.opensearch.replication.action.repository.TransportReleaseLeaderResourcesAction.shardOperation(TransportReleaseLeaderResourcesAction.kt:48) at org.opensearch.replication.action.repository.TransportReleaseLeaderResourcesAction.shardOperation(TransportReleaseLeaderResourcesAction.kt:31) at org.opensearch.action.support.single.shard.TransportSingleShardAction.lambda$asyncShardOperation$0(TransportSingleShardAction.java:131) at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74) at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1005) at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.lang.Thread.run(Thread.java:1583)

borutlukic avatar Nov 25 '24 16:11 borutlukic

Setting: "plugins.replication.follower.index.recovery.chunk_size": "1gb", "plugins.replication.follower.index.recovery.max_concurrent_file_chunks": "1"

Seems to fix the issue. It appears that if the files are too large on the primary cluster, replication fails to start unless recovery.chunk_size is big enough to transfer files in one go.

borutlukic avatar Nov 26 '24 08:11 borutlukic

It appears that setting 'plugins.replication.follower.index.recovery.chunk_size' to the max (which is 1gb) I can get all but one index to replicate. Would it be possible to raise the limit to higher than 1gb? As there appears to be something strange happening when the transfer of files needs to happen in chunks. The chunk with offset >0 will fail to get transferred with the 'java.lang.IllegalStateException: confined'

borutlukic avatar Nov 28 '24 08:11 borutlukic

[Catch All Triage - 1, 2, 3, 4]

dblock avatar Jan 06 '25 17:01 dblock

I am experiencing the same problem. After upgrading to OpenSearch 2.18.0 from 2.12.0, I am unable to create the replication. My index has a 10GB shard, and it cannot be transferred.

Lavisx avatar Jan 09 '25 11:01 Lavisx

https://github.com/opensearch-project/cross-cluster-replication/issues/1482

There seems to be a similar issue. When setting up CCR on an index that already contains data, an error occurs, and the data fails to transfer.

10000-ki avatar Feb 04 '25 05:02 10000-ki

https://github.com/opensearch-project/cross-cluster-replication/issues/1465#issuecomment-2499956562

yes right

I agree, and it seems to work fine when the chunk size is set sufficiently large. However, this does not appear to be a fundamental solution, and internal improvements seem to be necessary.

@dblock It seems like an important issue. Could you check it??

10000-ki avatar Feb 04 '25 06:02 10000-ki

@krisfreedain I’m wondering if the CCR team is aware of this issue. It seems like a pretty critical bug, but no one seems to be paying attention to it.

10000-ki avatar Mar 15 '25 07:03 10000-ki

Setting: "plugins.replication.follower.index.recovery.chunk_size": "1gb", "plugins.replication.follower.index.recovery.max_concurrent_file_chunks": "1"

Seems to fix the issue. It appears that if the files are too large on the primary cluster, replication fails to start unless recovery.chunk_size is big enough to transfer files in one go.

This solved my problem for now, I'll try with bigger indices. The ones that got stuck were only ~90MB and this was a test cluster setup to test CCR specifically. I want to replace a prod ES cluster with several 500GB indices. Wondering if this will work 🤔

sGoico avatar Apr 01 '25 16:04 sGoico

Not sure if its related to this change: https://github.com/apache/lucene/commit/c8e05c8cd6e018133d3d739d159d5b3b6b29c179#diff-be8459728adce212557bc19afbe0a8b91c788a49b1379323af40e3ba3ec4e479R154

Can you retry after disabling this feature by adding this JVM arg: "-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false"

ankitkala avatar Apr 02 '25 10:04 ankitkala

Might be related to https://github.com/opensearch-project/OpenSearch/pull/17502

We can do similar change here: https://github.com/opensearch-project/cross-cluster-replication/blob/main/src/main/kotlin/org/opensearch/replication/repository/RestoreContext.kt#L42

ankitkala avatar Apr 02 '25 13:04 ankitkala

Might be related to https://github.com/opensearch-project/OpenSearch/pull/17502

You're right @ankitkala

This was introduced by https://github.com/apache/lucene/pull/13535, where MemorySegmentIndexInput switched to confined Areas for IOContext.READONCE cases. Apparently, this is not the case where multiple chunks (hence multiple threads, potentially) are involved.

mishail avatar Apr 03 '25 23:04 mishail

Were you able to mitigate this by using jvm argument shared above? I'll check if this can be fixed for upcoming versions.

ankitkala avatar Apr 04 '25 00:04 ankitkala

If you're referring to -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false then I don't think that setting exists anymore, as it was removed by https://github.com/apache/lucene/pull/13146 , so the only way to fix this is to change the code.

I was able to reproduce the problem locally by hacking some small values into REPLICATION_FOLLOWER_RECOVERY_CHUNK_SIZE setting and running BasicReplicationIT

mishail avatar Apr 04 '25 01:04 mishail

https://github.com/opensearch-project/cross-cluster-replication/blob/4d04eb5025649f1337a907cf9049ac999ed7b759/src/main/kotlin/org/opensearch/replication/repository/RestoreContext.kt#L42

val input = directory.openInput(file.name, IOContext.READ)  // instead of READONCE

The easiest solution, despite potential performance issues and additional GC overhead, is to change the IOContext mode. READ

but i think it is not the best way

10000-ki avatar Apr 04 '25 06:04 10000-ki

I can confirm that using IOContext.READ fixes the problem for me (must be installed on both sides, had some failed tries before realizing...)

grmblfrz avatar Apr 04 '25 15:04 grmblfrz

@ankitkala https://github.com/opensearch-project/cross-cluster-replication/issues/1465#issuecomment-2777691141

how about this solution???

10000-ki avatar Apr 07 '25 02:04 10000-ki

Setting IOContext.READ seems like the right way forward.

ankitkala avatar Apr 07 '25 04:04 ankitkala

For historical reference, here is a more detailed explanation of the issue:

It appears that Apache Lucene has recently updated its code to leverage the Java Project Panama API to access native memory more efficiently. Project Panama provides a safer and more efficient way to interact with native code (e.g., C libraries) from the JVM. Its main goal is to improve upon the complexity and potential risks associated with traditional JNI (Java Native Interface).

Reference: https://openjdk.org/projects/panama/

While reviewing the internal code of Apache Lucene:

https://github.com/apache/lucene/blob/88d3573d3b459f8d980898837b403c89ba8b5986/lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInputProvider.java#L61

We can see that when the IOContext.READONCE option is used, Arena.ofConfined() is specified.

Segments created using Arena.ofConfined() can only be accessed by a single thread. see : https://cr.openjdk.org/~pminborg/panama/21/v1/javadoc/java.base/java/lang/foreign/Arena.html

Returns a new confined arena, owned by the current thread.

However, in our current CCR architecture which is based on multi-threading, this caused an issue.

We have now modified the implementation to use Arena.ofShared() instead, which allows the segment to be accessed by multiple threads. This can be easily controlled via the IOContext.DEFAULT option.

In the long term, it would be worth testing whether the Arena.ofConfined() approach actually performs worse than the multi-threaded model. Using Arena.ofConfined() may reduce the need for explicit lock management and unnecessary buffer allocations, potentially lowering GC overhead.

c.c @ankitkala

10000-ki avatar Apr 09 '25 07:04 10000-ki

This is happening to us on AWS-managed Opensearch 2.19 clusters with quite small indices (4.2gb primaries, 10,423,694 docs) @ankitkala what's the lowest OS version where this is fixed?

luisfavila avatar Jun 26 '25 12:06 luisfavila

@luisfavila hi

The issue has been fixed in version 2.19.2 and later. Could you try using version 2.19.2 or higher?

10000-ki avatar Jun 27 '25 08:06 10000-ki

@10000-ki Thank you! Trying to get AWS support to upgrade us to that - they don't let us select minor versions (only 2.17, 2.18, 2.19) with the latest being 2.19, which we're already on and where the bug is still present, so no idea.

Edit: AWS said they won't help unless we subscribe for Business support, ahah. Stay away from managed solutions people.

Shouldn't the PR be listed in the release notes? https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.19.2.md

luisfavila avatar Jun 27 '25 09:06 luisfavila

@luisfavila https://github.com/opensearch-project/cross-cluster-replication/commits/2.19/

see

Image

if u can't choice the patch version, u have to use opensearch v3.0.0 maybe...

10000-ki avatar Jun 30 '25 14:06 10000-ki