OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[BUG] org.opensearch.cluster.coordination.CoordinationStateRejectedException upgrading 2.0.1 to 3.0

Open cliu123 opened this issue 2 years ago โ€ข 22 comments

Describe the bug Nodes cannot join back to the cluster after upgrading from 2.0.1 to 3.0.0.

To Reproduce Failing GHA: https://github.com/cliu123/security/runs/6926965970?check_suite_focus=true

Expected behavior A clear and concise description of what you expected to happen.

Plugins Please list all plugins currently enabled.

Error logs

 WARN ][o.o.c.NodeConnectionsService] [securityBwcCluster0-2] failed to connect to {securityBwcCluster0-1}{Us2JKk3aSU-lIJVCnUQ02w}{5Wnx4OM1RkSweVKL6lkfdA}{127.0.0.1}{127.0.0.1:43131}{dimr}{testattr=test, shard_indexing_pressure_enabled=true} (tried [1] times)
ยป  org.opensearch.transport.ConnectTransportException: [securityBwcCluster0-1][127.0.0.1:43131] connect_exception
ยป  	at org.opensearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:1076) ~[opensearch-2.0.1.jar:2.0.1]
ยป  	at org.opensearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:215) ~[opensearch-2.0.1.jar:2.0.1]
ยป  	at org.opensearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:55) ~[opensearch-core-2.0.1.jar:2.0.1]
ยป  	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
ยป  	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]
ยป  	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
ยป  	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) ~[?:?]
ยป  	at org.opensearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:70) ~[opensearch-core-2.0.1.jar:2.0.1]
ยป  	at org.opensearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:81) ~[?:?]
ยป  	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) ~[?:?]
ยป  	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) ~[?:?]
ยป  	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) ~[?:?]
ยป  	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) ~[?:?]
ยป  	at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) ~[?:?]
ยป  	at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) ~[?:?]
ยป  	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[?:?]
ยป  	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321) ~[?:?]
ยป  	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337) ~[?:?]
ยป  	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710) ~[?:?]
ยป  	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:623) ~[?:?]
ยป  	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:586) ~[?:?]
ยป  	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[?:?]
ยป  	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[?:?]
ยป  	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
ยป  	at java.lang.Thread.run(Thread.java:833) [?:?]
ยป  Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 127.0.0.1/127.0.0.1:43131
ยป  Caused by: java.net.ConnectException: Connection refused
ยป  	at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
ยป  	at sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[?:?]
ยป  	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[?:?]
ยป  	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) ~[?:?]
ยป  	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ~[?:?]
ยป  	... 7 more
ยป   โ†“ last 40 non error or warning messages from /home/runner/work/security/security/bwc-test/build/testclusters/securityBwcCluster0-2/logs/opensearch.stdout.log โ†“
ยป [2022-06-16T22:10:14,761][INFO ][o.o.n.Node               ] [securityBwcCluster0-2] node name [securityBwcCluster0-2], node ID [jjX5m3WkSLe8eAwhuQhn7A], cluster name [securityBwcCluster0], roles [cluster_manager, remote_cluster_client, data, ingest]
ยป [2022-06-16T22:10:19,058][INFO ][o.o.t.NettyAllocator     ] [securityBwcCluster0-2] creating NettyAllocator with the following configs: [name=unpooled, suggested_max_allocation_size=256kb, factors={opensearch.unsafe.use_unpooled_allocator=null, g1gc_enabled=true, g1gc_region_size=1mb, heap_size=512mb}]
ยป [2022-06-16T22:10:19,129][INFO ][o.o.d.DiscoveryModule    ] [securityBwcCluster0-2] using discovery type [zen] and seed hosts providers [settings, file]
ยป [2022-06-16T22:10:19,577][INFO ][o.o.n.Node               ] [securityBwcCluster0-2] initialized
ยป [2022-06-16T22:10:19,577][INFO ][o.o.n.Node               ] [securityBwcCluster0-2] starting ...
ยป [2022-06-16T22:10:19,727][INFO ][o.o.t.TransportService   ] [securityBwcCluster0-2] publish_address {127.0.0.1:42529}, bound_addresses {[::1]:37963}, {127.0.0.1:42529}
ยป [2022-06-16T22:10:19,932][DEBUG][o.o.c.c.Coordinator      ] [securityBwcCluster0-2] startInitialJoin: coordinator becoming CANDIDATE in term 0 (was null, lastKnownLeader was [Optional.empty])
ยป [2022-06-16T22:10:19,960][INFO ][o.o.h.AbstractHttpServerTransport] [securityBwcCluster0-2] publish_address {127.0.0.1:46681}, bound_addresses {[::1]:41935}, {127.0.0.1:46681}
ยป [2022-06-16T22:10:19,962][INFO ][o.o.n.Node               ] [securityBwcCluster0-2] started
ยป [2022-06-16T22:10:19,962][INFO ][o.o.s.OpenSearchSecurityPlugin] [securityBwcCluster0-2] Node started
ยป [2022-06-16T22:10:19,968][INFO ][o.o.s.OpenSearchSecurityPlugin] [securityBwcCluster0-2] 0 OpenSearch Security modules loaded so far: []
ยป [2022-06-16T22:10:20,876][INFO ][o.o.c.c.Coordinator      ] [securityBwcCluster0-2] setting initial configuration to VotingConfiguration{jjX5m3WkSLe8eAwhuQhn7A,Us2JKk3aSU-lIJVCnUQ02w,{bootstrap-placeholder}-securityBwcCluster0-0}
ยป [2022-06-16T22:10:21,093][DEBUG][o.o.c.c.ElectionSchedulerFactory] [securityBwcCluster0-2] scheduling scheduleNextElection{gracePeriod=0s, thisAttempt=0, maxDelayMillis=100, delayMillis=22, ElectionScheduler{attempt=1, ElectionSchedulerFactory{initialTimeout=100ms, backoffTime=100ms, maxTimeout=10s}}}
ยป [2022-06-16T22:10:21,103][DEBUG][o.o.c.c.Coordinator      ] [securityBwcCluster0-2] joinLeaderInTerm: for [{securityBwcCluster0-1}{Us2JKk3aSU-lIJVCnUQ02w}{5Wnx4OM1RkSweVKL6lkfdA}{127.0.0.1}{127.0.0.1:43131}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}] with term 1
ยป [2022-06-16T22:10:21,104][DEBUG][o.o.c.c.CoordinationState] [securityBwcCluster0-2] handleStartJoin: leaving term [0] due to StartJoinRequest{term=1,node={securityBwcCluster0-1}{Us2JKk3aSU-lIJVCnUQ02w}{5Wnx4OM1RkSweVKL6lkfdA}{127.0.0.1}{127.0.0.1:43131}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}}
ยป [2022-06-16T22:10:21,129][DEBUG][o.o.c.c.ElectionSchedulerFactory] [securityBwcCluster0-2] scheduleNextElection{gracePeriod=0s, thisAttempt=0, maxDelayMillis=100, delayMillis=22, ElectionScheduler{attempt=1, ElectionSchedulerFactory{initialTimeout=100ms, backoffTime=100ms, maxTimeout=10s}}} starting election
ยป [2022-06-16T22:10:21,131][DEBUG][o.o.c.c.ElectionSchedulerFactory] [securityBwcCluster0-2] scheduling scheduleNextElection{gracePeriod=500ms, thisAttempt=1, maxDelayMillis=200, delayMillis=645, ElectionScheduler{attempt=2, ElectionSchedulerFactory{initialTimeout=100ms, backoffTime=100ms, maxTimeout=10s}}}
ยป [2022-06-16T22:10:21,134][DEBUG][o.o.c.c.PreVoteCollector ] [securityBwcCluster0-2] PreVotingRound{preVotesReceived={}, electionStarted=false, preVoteRequest=PreVoteRequest{sourceNode={securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, currentTerm=1}, isClosed=false} requesting pre-votes from [{securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, {securityBwcCluster0-1}{Us2JKk3aSU-lIJVCnUQ02w}{5Wnx4OM1RkSweVKL6lkfdA}{127.0.0.1}{127.0.0.1:43131}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, {securityBwcCluster0-0}{MntY5LcxSzCwoWnBav5SUA}{QagsblRtSNihbQhNLBZukQ}{127.0.0.1}{127.0.0.1:46477}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}]
ยป [2022-06-16T22:10:21,157][DEBUG][o.o.c.c.PreVoteCollector ] [securityBwcCluster0-2] PreVotingRound{preVotesReceived={{securityBwcCluster0-0}{MntY5LcxSzCwoWnBav5SUA}{QagsblRtSNihbQhNLBZukQ}{127.0.0.1}{127.0.0.1:46477}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}=PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0}}, electionStarted=false, preVoteRequest=PreVoteRequest{sourceNode={securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, currentTerm=1}, isClosed=false} added PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0} from {securityBwcCluster0-0}{MntY5LcxSzCwoWnBav5SUA}{QagsblRtSNihbQhNLBZukQ}{127.0.0.1}{127.0.0.1:46477}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, no quorum yet
ยป [2022-06-16T22:10:21,159][DEBUG][o.o.c.c.PreVoteCollector ] [securityBwcCluster0-2] PreVotingRound{preVotesReceived={{securityBwcCluster0-0}{MntY5LcxSzCwoWnBav5SUA}{QagsblRtSNihbQhNLBZukQ}{127.0.0.1}{127.0.0.1:46477}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}=PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0}, {securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}=PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0}}, electionStarted=false, preVoteRequest=PreVoteRequest{sourceNode={securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, currentTerm=1}, isClosed=false} added PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0} from {securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, no quorum yet
ยป [2022-06-16T22:10:21,150][DEBUG][o.o.c.c.PreVoteCollector ] [securityBwcCluster0-2] TransportResponseHandler{PreVoteCollector{state=Tuple [v1=null, v2=PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0}]}, node={securityBwcCluster0-1}{Us2JKk3aSU-lIJVCnUQ02w}{5Wnx4OM1RkSweVKL6lkfdA}{127.0.0.1}{127.0.0.1:43131}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}} failed
ยป  org.opensearch.transport.RemoteTransportException: [securityBwcCluster0-1][127.0.0.1:43131][internal:cluster/request_pre_vote]
ยป  Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: rejecting PreVoteRequest{sourceNode={securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, currentTerm=1} as there is already a leader
ยป  	at org.opensearch.cluster.coordination.PreVoteCollector.handlePreVoteRequest(PreVoteCollector.java:162) ~[opensearch-2.0.1.jar:2.0.1]
ยป  	at org.opensearch.cluster.coordination.PreVoteCollector.lambda$new$0(PreVoteCollector.java:100) ~[opensearch-2.0.1.jar:2.0.1]
ยป  	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:103) ~[opensearch-2.0.1.jar:2.0.1]
ยป  	at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-2.0.1.jar:2.0.1]
ยป  	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:798) ~[opensearch-2.0.1.jar:2.0.1]
ยป  	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.0.1.jar:2.0.1]
ยป  	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
ยป  	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
ยป  	at java.lang.Thread.run(Thread.java:833) [?:?]

cliu123 avatar Jun 16 '22 22:06 cliu123

Is this consistently reproducible? Did you try to debug it? The error looks suspicious:

rejecting PreVoteRequest{sourceNode={securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, currentTerm=1} as there is already a leader

could be related to master/clustermanager/leader renaming?

dblock avatar Jun 17 '22 18:06 dblock

@cliu123 did you try running opensearch-min without security plugin? OpenSearch 3.0.0 is passing bwc tests with 2.0.1. Ref: https://github.com/opensearch-project/OpenSearch/blob/main/.ci/bwcVersions

saratvemulapalli avatar Jun 21 '22 20:06 saratvemulapalli

@cliu123 did you try running opensearch-min without security plugin? OpenSearch 3.0.0 is passing bwc tests with 2.0.1. Ref: https://github.com/opensearch-project/OpenSearch/blob/main/.ci/bwcVersions

That's a good point! I haven't tried that as I don't see any security specific errors. But I'll try.

cliu123 avatar Jun 21 '22 20:06 cliu123

@saratvemulapalli Are there any passing BWC test runs from 2.0.1 to 3.0.0? I don't see any BWC runs in GitHub actions in this repo. Would you please share a pointer to the passing test run? I'd like to use it to investigate the failures in the security repo.

cliu123 avatar Jun 21 '22 21:06 cliu123

Sure, here is our testing process for bwc: https://github.com/opensearch-project/OpenSearch/blob/main/TESTING.md#testing-backwards-compatibility

We dont use Github workflows for testing, instead we use jenkins infra. Here is one successful run on a PR: https://github.com/opensearch-project/OpenSearch/pull/3618#issuecomment-1158357646 Logs: https://ci.opensearch.org/logs/ci/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/gradle_check_6087.log

saratvemulapalli avatar Jun 21 '22 21:06 saratvemulapalli

@saratvemulapalli Thanks! But I don't see any tests upgrading to 3.0.0 in the reports/logs. The error included in the issue description shows that while the upgrade node tries to join back to the cluster, issues happen during master election(Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: rejecting PreVoteRequest), which seems to be something wrong in the core engine. Besides, without installing security plugin, I see the same error.

cliu123 avatar Jun 21 '22 22:06 cliu123

@cliu123 ./gradlew bwcTest should run all the tests. I did run it locally, 3.0.0 node was able to join a 2.0.1 cluster. How are your tests setup ? Also looking at your change, there are some changes with inclusive naming.

Sure let me take a stab at this.

saratvemulapalli avatar Jun 22 '22 18:06 saratvemulapalli

I see the same error

The setup has problems with settings, very likely it didnt disable security.

ยป  java.lang.IllegalArgumentException: unknown setting [plugins.security.disabled] please check that any required plugins are installed, or check the breaking changes documentation for removed settings
ยป  	at org.opensearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:591)
ยป  	at org.opensearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:532)
ยป  	at org.opensearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:502)
ยป  	at org.opensearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:472)
ยป  	at org.opensearch.common.settings.SettingsModule.<init>(SettingsModule.java:170)
ยป  	at org.opensearch.node.Node.<init>(Node.java:479)
ยป  	at org.opensearch.node.Node.<init>(Node.java:339)
ยป  	at org.opensearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:242)
ยป  	at org.opensearch.bootstrap.Bootstrap.setup(Bootstrap.java:242)
ยป  	at org.opensearch.bootstrap.Bootstrap.init(Bootstrap.java:404)
ยป  	at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:180)
ยป  	at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:171)
ยป  	at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104)
ยป  	at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138)
ยป  	at org.opensearch.cli.Command.main(Command.java:101)
ยป  	at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:137)
ยป  	at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:103)
ยป  For complete error details, refer to the log at /home/runner/work/security/security/bwc-test/build/testclusters/securityBwcCluster1-0/logs/securityBwcCluster1.log

saratvemulapalli avatar Jun 22 '22 21:06 saratvemulapalli

@saratvemulapalli Would you please check or provide the logs upgrade from 2.0.1 to 3.0.0 without security plugin? I don't see the related logs in https://ci.opensearch.org/logs/ci/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/gradle_check_6087.log. Did I miss anything?

cliu123 avatar Jun 23 '22 21:06 cliu123

@Rishikesh1159 could you take a look at this.

saratvemulapalli avatar Jun 29 '22 21:06 saratvemulapalli

@Rishikesh1159 could you take a look at this.

Sure

Rishikesh1159 avatar Jun 29 '22 21:06 Rishikesh1159

@saratvemulapalli Would you please check or provide the logs upgrade from 2.0.1 to 3.0.0 without security plugin? I don't see the related logs in https://ci.opensearch.org/logs/ci/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/gradle_check_6087.log. Did I miss anything?

@cliu123 You can see the the backward compatibility tests being run as a part of that gradle check:

$ grep 'bwcTest' ~/Downloads/gradle_check_6087.log
> Task :qa:verify-version-constants:v2.1.0#bwcTest
> Task :qa:verify-version-constants:v2.0.2#bwcTest
> Task :qa:verify-version-constants:bwcTestSnapshots
> Task :qa:repository-multi-version:v2.0.2#bwcTest
> Task :qa:repository-multi-version:v2.1.0#bwcTest
> Task :qa:repository-multi-version:bwcTestSnapshots
> Task :qa:rolling-upgrade:v2.1.0#bwcTest
> Task :qa:rolling-upgrade:bwcTestSnapshots
> Task :qa:full-cluster-restart:v2.0.2#bwcTest
> Task :qa:full-cluster-restart:v2.1.0#bwcTest
> Task :qa:full-cluster-restart:bwcTestSnapshots
> Task :qa:mixed-cluster:v2.1.0#bwcTest
> Task :qa:mixed-cluster:bwcTestSnapshots

Are there specific logs that you're looking for? I don't think the test output logs a whole lot when the tasks succeed.

andrross avatar Jul 08 '22 16:07 andrross

Can someone please help with this issue? Its blocking security plugin version bump to 3.0. We want to build cross-cluster-replication for 3.0 and since we've a dependency on security plugin, we're blocked as well.

ankitkala avatar Jul 20 '22 16:07 ankitkala

The BWC test failure still persists: https://github.com/cliu123/security/runs/7434973804?check_suite_focus=true. Looks like node failed to join back to the cluster after upgrading. @CEHENKLE Could anyone take a look?

cliu123 avatar Jul 20 '22 17:07 cliu123

@amitgalitz also got BWC test failures when upgrading to 3.0.0 with job-scheduler plugin installed but without security plugin installed: https://github.com/opensearch-project/job-scheduler/runs/7415033030?check_suite_focus=true.

cliu123 avatar Jul 21 '22 20:07 cliu123

k-NN plugin also has same issue with BWC Tests when we are trying to upgrade from 2.1.0 to 3.0.0-SNAPSHOT. But, the interesting thing is Restart Upgrade BWC Tests are working and Rolling Upgrade BWC Tests are failing. The link to GitHub Action - https://github.com/naveentatikonda/k-NN/runs/7508740179?check_suite_focus=true

naveentatikonda avatar Jul 25 '22 21:07 naveentatikonda

@amitgalitz also got BWC test failures when upgrading to 3.0.0 with job-scheduler plugin installed but without security plugin installed: https://github.com/opensearch-project/job-scheduler/runs/7415033030?check_suite_focus=true.

@cliu123 @amitgalitz In the above logs it says that you are trying to upgrade from 7.10.2 to 3.0.0. I think we cannot upgrade directly from 7.x to 3.0.0. Could you pls try to upgrade from 2.x to 3.0.0-SNAPSHOT?

java.lang.IllegalStateException: cannot upgrade a node from version [7.10.2] directly to version [3.0.0]
ยป  	at org.opensearch.env.NodeMetadata.upgradeToCurrentVersion(NodeMetadata.java:101)
ยป  	at org.opensearch.env.NodeEnvironment.loadNodeMetadata(NodeEnvironment.java:476)
ยป  	at org.opensearch.env.NodeEnvironment.<init>(NodeEnvironment.java:369)
ยป  	at org.opensearch.node.Node.<init>(Node.java:433)
ยป  	at org.opensearch.node.Node.<init>(Node.java:342)
ยป  	at org.opensearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:242)
ยป  	at org.opensearch.bootstrap.Bootstrap.setup(Bootstrap.java:242)
ยป  	at org.opensearch.bootstrap.Bootstrap.init(Bootstrap.java:404)
ยป  	at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:180)
ยป  	at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:171)
ยป  	at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104)
ยป  	at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138)
ยป  	at org.opensearch.cli.Command.main(Command.java:101)
ยป  	at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:137)
ยป  	at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:103) 

naveentatikonda avatar Jul 25 '22 22:07 naveentatikonda

hi @naveentatikonda , yes for job-scheduler its the same issue. i tried upgrading from 2.x to 3.0.0 & faced the issue i linked. today i'll test again to see if the issue is fixed.

sbcd90 avatar Jul 28 '22 17:07 sbcd90

I seemed to have run into the same issue with Notifications as well when upgrading to 3.0.0. The plugin passes all other tests and can come up with 3.0 but the BWC test fails with the same error.

qreshi avatar Jul 28 '22 18:07 qreshi

@VachaShah (our oncall for this week) could you take a stab at this one? Looks like multiple plugins are impacted for 3.0.

saratvemulapalli avatar Jul 28 '22 21:07 saratvemulapalli

I took a quick look at all of the failing runs, and I can't see a common root cause or pattern to them. Summarizing my observations below:


I re-ran the BWC tests for job-scheduler and the task :opensearch-job-scheduler-sample-extension:jobSchedulerBwcCluster#fullRestartClusterTask fails because one of the nodes fails to start -

java.lang.IllegalStateException: cannot upgrade a node from version [7.10.2] directly to version [3.0.0]

As noted by @naveentatikonda please try upgrading from 2.x to 3.0.0-SNAPSHOT instead.


Looking at the BWC test run for the k-NN plugin, the :qa:rolling-upgrade:testAgainstOneThirdUpgradedCluster task failed because the cluster didn't pass health checks after 40 seconds. However, the cluster does turn green at one point:

[2022-07-25T21:07:47,736][INFO ][o.o.c.r.a.AllocationService] [knnBwcCluster-rolling-1] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[knn-bwc-testknnwarmup][0]]])

but subsequently the 3.0.0-SNAPSHOT node is (for some reason) detected as disconnected, marked as faulty and removed:

[2022-07-25T21:07:48,446][INFO ][o.o.c.c.FollowersChecker ] [knnBwcCluster-rolling-1] FollowerChecker{discoveryNode={knnBwcCluster-rolling-0}{nRAPlFMNTI2Yt4E4YW60RA}{WvizLTUJT7q6e49dEjcrIA}{127.0.0.1}{127.0.0.1:38491}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, failureCountSinceLastSuccess=0, [cluster.fault_detection.follower_check.retry_count]=3} disconnected
ยป [2022-07-25T21:07:48,448][INFO ][o.o.c.c.FollowersChecker ] [knnBwcCluster-rolling-1] FollowerChecker{discoveryNode={knnBwcCluster-rolling-0}{nRAPlFMNTI2Yt4E4YW60RA}{WvizLTUJT7q6e49dEjcrIA}{127.0.0.1}{127.0.0.1:38491}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, failureCountSinceLastSuccess=0, [cluster.fault_detection.follower_check.retry_count]=3} marking node as faulty
ยป [2022-07-25T21:07:48,469][INFO ][o.o.c.s.MasterService    ] [knnBwcCluster-rolling-1] node-left[{knnBwcCluster-rolling-0}{nRAPlFMNTI2Yt4E4YW60RA}{WvizLTUJT7q6e49dEjcrIA}{127.0.0.1}{127.0.0.1:38491}{dimr}{testattr=test, shard_indexing_pressure_enabled=true} reason: disconnected], term: 2, version: 41, delta: removed {{knnBwcCluster-rolling-0}{nRAPlFMNTI2Yt4E4YW60RA}{WvizLTUJT7q6e49dEjcrIA}{127.0.0.1}{127.0.0.1:38491}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}}
ยป [2022-07-25T21:07:48,470][DEBUG][o.o.c.c.PublicationTransportHandler] [knnBwcCluster-rolling-1] received diff cluster state version [41] with uuid [A5Mb175ASyqEcDfOOAWj4w], diff size [600]
ยป [2022-07-25T21:07:48,511][INFO ][o.o.c.s.ClusterApplierService] [knnBwcCluster-rolling-1] removed {{knnBwcCluster-rolling-0}{nRAPlFMNTI2Yt4E4YW60RA}{WvizLTUJT7q6e49dEjcrIA}{127.0.0.1}{127.0.0.1:38491}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}}, term: 2, version: 41, reason: Publication{term=2, version=41}

Meanwhile the 3.0.0-SNAPSHOt node is initially unable to resolve a ClusterManager. It ultimately starts up, but the timestamp seems too late:

[2022-07-25T21:07:58,382][INFO ][o.o.c.c.Coordinator      ] [knnBwcCluster-rolling-0] cluster UUID [G8n67oeoTiWBJaYnbcY6-w]
ยป [2022-07-25T21:07:58,385][DEBUG][o.o.c.c.Coordinator      ] [knnBwcCluster-rolling-0] startInitialJoin: coordinator becoming CANDIDATE in term 2 (was null, lastKnownLeader was [Optional.empty])
ยป [2022-07-25T21:07:58,406][INFO ][o.o.h.AbstractHttpServerTransport] [knnBwcCluster-rolling-0] publish_address {127.0.0.1:40253}, bound_addresses {[::1]:43789}, {127.0.0.1:40253}
ยป [2022-07-25T21:07:58,407][INFO ][o.o.n.Node               ] [knnBwcCluster-rolling-0] started

The security plugin shows a similar chain of events to k-NN though the cluster never actually turns green in this case.

kartg avatar Aug 09 '22 04:08 kartg

@kartg Thanks for looking into this and sharing your observations! I included the log that I had in the issue description. It was upgrading from 2.0.1 to 3.0.0, but failed.

cliu123 avatar Aug 09 '22 17:08 cliu123

hi @kartg , the current pr for job scheduler doesnt have the 2.0.1 fix. the actual error is the one mentioned by @cliu123 above.

the mixedClusterTask in bwc tests fail. here are some of the exception logs:

org.opensearch.discovery.ClusterManagerNotDiscoveredException: null
ยป  	at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:283) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
ยป  	at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
ยป  	at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
ยป  	at org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:697) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
ยป  	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:739) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
[jobSchedulerBwcCluster0-0] cluster-manager not discovered yet, this node has not previously joined a bootstrapped cluster, and this node must discover cluster-manager-eligible nodes [jobSchedulerBwcCluster0-0, jobSchedulerBwcCluster0-1, jobSchedulerBwcCluster0-2] to bootstrap a cluster: have discovered [{jobSchedulerBwcCluster0-0}{9SouJ8iLTEyiTyEwwoHp8w}{YmkyQHPdS1KWJmMpVKxTiA}{127.0.0.1}{127.0.0.1:33145}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}]; discovery will continue using [] from hosts providers and [{jobSchedulerBwcCluster0-0}{9SouJ8iLTEyiTyEwwoHp8w}{YmkyQHPdS1KWJmMpVKxTiA}{127.0.0.1}{127.0.0.1:33145}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 0, last-accepted version 0 in term 0
ยป WARN ][o.o.c.c.ClusterFormationFailureHelper] [jobSchedulerBwcCluster0-0] cluster-manager not discovered or elected yet, an election requires at least 2 nodes with ids from [K4KL68aXRbWpqXO3eNH_Cg, E0bS9c3KTRumdXQmDzZdLw, 9SouJ8iLTEyiTyEwwoHp8w], have discovered [{jobSchedulerBwcCluster0-0}{9SouJ8iLTEyiTyEwwoHp8w}{uBo3iH55TOyBFimNH6tZ7g}{127.0.0.1}{127.0.0.1:36763}{dimr}{upgraded=true, testattr=test, shard_indexing_pressure_enabled=true}] which is not a quorum; discovery will continue using [[::1]:36639, 127.0.0.1:35113, [::1]:46143, 127.0.0.1:41555] from hosts providers and [{jobSchedulerBwcCluster0-0}{9SouJ8iLTEyiTyEwwoHp8w}{uBo3iH55TOyBFimNH6tZ7g}{127.0.0.1}{127.0.0.1:36763}{dimr}{upgraded=true, testattr=test, shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 3, last-accepted version 4 in term 3
ยป   โ†‘ repeated 6 times โ†‘

sbcd90 avatar Aug 11 '22 18:08 sbcd90

@cliu123 The BWC tests for OpenSearch seem to be having no issues. From the logs, it's not clear what is going on? k-NN fails on the rolling upgrade tests :qa:rolling-upgrade:knnBwcCluster-rolling while security plugin fails on the mixed cluster tests, :securityBwcCluster#mixedClusterTask with different errors. The underlying cause may be the same, but it's not obvious.

I tried to reproduce the issue locally from the branch https://github.com/cliu123/security/tree/bump_version_to_3.0.0.0 but the build fails. Can you rebase the branch so we can build and reproduce the errors locally.

adnapibar avatar Aug 29 '22 17:08 adnapibar

This is also blocking the work of AD Extension and Job Scheduler.

owaiskazi19 avatar Aug 30 '22 18:08 owaiskazi19

This is also blocking the work of AD Extension and Job Scheduler.

Do we have a branch for AD and/or Job Scheduler that we can use to reproduce the issue locally?

adnapibar avatar Aug 30 '22 18:08 adnapibar

Do we have a branch for AD and/or Job Scheduler that we can use to reproduce the issue locally?

We can just update the main branch of Job Scheduler to 3.0.0-SNAPSHOT and reproduce this issue.

owaiskazi19 avatar Aug 30 '22 18:08 owaiskazi19

For job-scheduler I'm getting a dependency version conflict error with OpenSearch 3.0.0-SNAPSHOT

Execution failed for task ':compileTestJava'.
> Could not resolve all dependencies for configuration ':testCompileClasspath'.
   > Conflict(s) found for the following module(s):
       - commons-codec:commons-codec between versions 1.15 and 1.13
     Run with:
         --scan or
         :dependencyInsight --configuration testCompileClasspath --dependency commons-codec:commons-codec
     to get more insight on how to solve the conflict.

Related issue - https://github.com/opensearch-project/OpenSearch/issues/4336

adnapibar avatar Aug 30 '22 20:08 adnapibar

That's weird. For job-scheduler I'm getting BWC test failure on main after updating OpenSearch to 3.0.0-SNAPSHOT

 What went wrong:
Execution failed for task ':opensearch-job-scheduler-sample-extension:jobSchedulerBwcCluster#fullRestartClusterTask'.
> process was found dead while waiting for ports files, node{:opensearch-job-scheduler-sample-extension:jobSchedulerBwcCluster1-0}

* Try:
> Run with --scan to get full insights.

* Exception is:
org.gradle.api.tasks.TaskExecutionException: Execution failed for task ':opensearch-job-scheduler-sample-extension:jobSchedulerBwcCluster#fullRestartClusterTask'.
        at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.lambda$executeIfValid$1(ExecuteActionsTaskExecuter.java:147)
        at org.gradle.internal.Try$Failure.ifSuccessfulOrElse(Try.java:282)
        at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.executeIfValid(ExecuteActionsTaskExecuter.java:145)
        at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.execute(ExecuteActionsTaskExecuter.java:133)
        at org.gradle.api.internal.tasks.execution.CleanupStaleOutputsExecuter.execute(CleanupStaleOutputsExecuter.java:77)
        at org.gradle.api.internal.tasks.execution.FinalizePropertiesTaskExecuter.execute(FinalizePropertiesTaskExecuter.java:46)

owaiskazi19 avatar Aug 30 '22 21:08 owaiskazi19

That's weird. For job-scheduler I'm getting BWC test failure on main after updating OpenSearch to 3.0.0-SNAPSHOT

Issue with my local gradle cache, now able to reproduce

adnapibar avatar Aug 30 '22 23:08 adnapibar