OpenSearch
OpenSearch copied to clipboard
[BUG] org.opensearch.cluster.coordination.CoordinationStateRejectedException upgrading 2.0.1 to 3.0
Describe the bug Nodes cannot join back to the cluster after upgrading from 2.0.1 to 3.0.0.
To Reproduce Failing GHA: https://github.com/cliu123/security/runs/6926965970?check_suite_focus=true
Expected behavior A clear and concise description of what you expected to happen.
Plugins Please list all plugins currently enabled.
Error logs
WARN ][o.o.c.NodeConnectionsService] [securityBwcCluster0-2] failed to connect to {securityBwcCluster0-1}{Us2JKk3aSU-lIJVCnUQ02w}{5Wnx4OM1RkSweVKL6lkfdA}{127.0.0.1}{127.0.0.1:43131}{dimr}{testattr=test, shard_indexing_pressure_enabled=true} (tried [1] times)
ยป org.opensearch.transport.ConnectTransportException: [securityBwcCluster0-1][127.0.0.1:43131] connect_exception
ยป at org.opensearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:1076) ~[opensearch-2.0.1.jar:2.0.1]
ยป at org.opensearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:215) ~[opensearch-2.0.1.jar:2.0.1]
ยป at org.opensearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:55) ~[opensearch-core-2.0.1.jar:2.0.1]
ยป at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
ยป at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]
ยป at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
ยป at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) ~[?:?]
ยป at org.opensearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:70) ~[opensearch-core-2.0.1.jar:2.0.1]
ยป at org.opensearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:81) ~[?:?]
ยป at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) ~[?:?]
ยป at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) ~[?:?]
ยป at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) ~[?:?]
ยป at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) ~[?:?]
ยป at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) ~[?:?]
ยป at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) ~[?:?]
ยป at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[?:?]
ยป at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321) ~[?:?]
ยป at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337) ~[?:?]
ยป at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710) ~[?:?]
ยป at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:623) ~[?:?]
ยป at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:586) ~[?:?]
ยป at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[?:?]
ยป at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[?:?]
ยป at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
ยป at java.lang.Thread.run(Thread.java:833) [?:?]
ยป Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 127.0.0.1/127.0.0.1:43131
ยป Caused by: java.net.ConnectException: Connection refused
ยป at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
ยป at sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[?:?]
ยป at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[?:?]
ยป at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) ~[?:?]
ยป at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ~[?:?]
ยป ... 7 more
ยป โ last 40 non error or warning messages from /home/runner/work/security/security/bwc-test/build/testclusters/securityBwcCluster0-2/logs/opensearch.stdout.log โ
ยป [2022-06-16T22:10:14,761][INFO ][o.o.n.Node ] [securityBwcCluster0-2] node name [securityBwcCluster0-2], node ID [jjX5m3WkSLe8eAwhuQhn7A], cluster name [securityBwcCluster0], roles [cluster_manager, remote_cluster_client, data, ingest]
ยป [2022-06-16T22:10:19,058][INFO ][o.o.t.NettyAllocator ] [securityBwcCluster0-2] creating NettyAllocator with the following configs: [name=unpooled, suggested_max_allocation_size=256kb, factors={opensearch.unsafe.use_unpooled_allocator=null, g1gc_enabled=true, g1gc_region_size=1mb, heap_size=512mb}]
ยป [2022-06-16T22:10:19,129][INFO ][o.o.d.DiscoveryModule ] [securityBwcCluster0-2] using discovery type [zen] and seed hosts providers [settings, file]
ยป [2022-06-16T22:10:19,577][INFO ][o.o.n.Node ] [securityBwcCluster0-2] initialized
ยป [2022-06-16T22:10:19,577][INFO ][o.o.n.Node ] [securityBwcCluster0-2] starting ...
ยป [2022-06-16T22:10:19,727][INFO ][o.o.t.TransportService ] [securityBwcCluster0-2] publish_address {127.0.0.1:42529}, bound_addresses {[::1]:37963}, {127.0.0.1:42529}
ยป [2022-06-16T22:10:19,932][DEBUG][o.o.c.c.Coordinator ] [securityBwcCluster0-2] startInitialJoin: coordinator becoming CANDIDATE in term 0 (was null, lastKnownLeader was [Optional.empty])
ยป [2022-06-16T22:10:19,960][INFO ][o.o.h.AbstractHttpServerTransport] [securityBwcCluster0-2] publish_address {127.0.0.1:46681}, bound_addresses {[::1]:41935}, {127.0.0.1:46681}
ยป [2022-06-16T22:10:19,962][INFO ][o.o.n.Node ] [securityBwcCluster0-2] started
ยป [2022-06-16T22:10:19,962][INFO ][o.o.s.OpenSearchSecurityPlugin] [securityBwcCluster0-2] Node started
ยป [2022-06-16T22:10:19,968][INFO ][o.o.s.OpenSearchSecurityPlugin] [securityBwcCluster0-2] 0 OpenSearch Security modules loaded so far: []
ยป [2022-06-16T22:10:20,876][INFO ][o.o.c.c.Coordinator ] [securityBwcCluster0-2] setting initial configuration to VotingConfiguration{jjX5m3WkSLe8eAwhuQhn7A,Us2JKk3aSU-lIJVCnUQ02w,{bootstrap-placeholder}-securityBwcCluster0-0}
ยป [2022-06-16T22:10:21,093][DEBUG][o.o.c.c.ElectionSchedulerFactory] [securityBwcCluster0-2] scheduling scheduleNextElection{gracePeriod=0s, thisAttempt=0, maxDelayMillis=100, delayMillis=22, ElectionScheduler{attempt=1, ElectionSchedulerFactory{initialTimeout=100ms, backoffTime=100ms, maxTimeout=10s}}}
ยป [2022-06-16T22:10:21,103][DEBUG][o.o.c.c.Coordinator ] [securityBwcCluster0-2] joinLeaderInTerm: for [{securityBwcCluster0-1}{Us2JKk3aSU-lIJVCnUQ02w}{5Wnx4OM1RkSweVKL6lkfdA}{127.0.0.1}{127.0.0.1:43131}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}] with term 1
ยป [2022-06-16T22:10:21,104][DEBUG][o.o.c.c.CoordinationState] [securityBwcCluster0-2] handleStartJoin: leaving term [0] due to StartJoinRequest{term=1,node={securityBwcCluster0-1}{Us2JKk3aSU-lIJVCnUQ02w}{5Wnx4OM1RkSweVKL6lkfdA}{127.0.0.1}{127.0.0.1:43131}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}}
ยป [2022-06-16T22:10:21,129][DEBUG][o.o.c.c.ElectionSchedulerFactory] [securityBwcCluster0-2] scheduleNextElection{gracePeriod=0s, thisAttempt=0, maxDelayMillis=100, delayMillis=22, ElectionScheduler{attempt=1, ElectionSchedulerFactory{initialTimeout=100ms, backoffTime=100ms, maxTimeout=10s}}} starting election
ยป [2022-06-16T22:10:21,131][DEBUG][o.o.c.c.ElectionSchedulerFactory] [securityBwcCluster0-2] scheduling scheduleNextElection{gracePeriod=500ms, thisAttempt=1, maxDelayMillis=200, delayMillis=645, ElectionScheduler{attempt=2, ElectionSchedulerFactory{initialTimeout=100ms, backoffTime=100ms, maxTimeout=10s}}}
ยป [2022-06-16T22:10:21,134][DEBUG][o.o.c.c.PreVoteCollector ] [securityBwcCluster0-2] PreVotingRound{preVotesReceived={}, electionStarted=false, preVoteRequest=PreVoteRequest{sourceNode={securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, currentTerm=1}, isClosed=false} requesting pre-votes from [{securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, {securityBwcCluster0-1}{Us2JKk3aSU-lIJVCnUQ02w}{5Wnx4OM1RkSweVKL6lkfdA}{127.0.0.1}{127.0.0.1:43131}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, {securityBwcCluster0-0}{MntY5LcxSzCwoWnBav5SUA}{QagsblRtSNihbQhNLBZukQ}{127.0.0.1}{127.0.0.1:46477}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}]
ยป [2022-06-16T22:10:21,157][DEBUG][o.o.c.c.PreVoteCollector ] [securityBwcCluster0-2] PreVotingRound{preVotesReceived={{securityBwcCluster0-0}{MntY5LcxSzCwoWnBav5SUA}{QagsblRtSNihbQhNLBZukQ}{127.0.0.1}{127.0.0.1:46477}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}=PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0}}, electionStarted=false, preVoteRequest=PreVoteRequest{sourceNode={securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, currentTerm=1}, isClosed=false} added PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0} from {securityBwcCluster0-0}{MntY5LcxSzCwoWnBav5SUA}{QagsblRtSNihbQhNLBZukQ}{127.0.0.1}{127.0.0.1:46477}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, no quorum yet
ยป [2022-06-16T22:10:21,159][DEBUG][o.o.c.c.PreVoteCollector ] [securityBwcCluster0-2] PreVotingRound{preVotesReceived={{securityBwcCluster0-0}{MntY5LcxSzCwoWnBav5SUA}{QagsblRtSNihbQhNLBZukQ}{127.0.0.1}{127.0.0.1:46477}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}=PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0}, {securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}=PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0}}, electionStarted=false, preVoteRequest=PreVoteRequest{sourceNode={securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, currentTerm=1}, isClosed=false} added PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0} from {securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, no quorum yet
ยป [2022-06-16T22:10:21,150][DEBUG][o.o.c.c.PreVoteCollector ] [securityBwcCluster0-2] TransportResponseHandler{PreVoteCollector{state=Tuple [v1=null, v2=PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0}]}, node={securityBwcCluster0-1}{Us2JKk3aSU-lIJVCnUQ02w}{5Wnx4OM1RkSweVKL6lkfdA}{127.0.0.1}{127.0.0.1:43131}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}} failed
ยป org.opensearch.transport.RemoteTransportException: [securityBwcCluster0-1][127.0.0.1:43131][internal:cluster/request_pre_vote]
ยป Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: rejecting PreVoteRequest{sourceNode={securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, currentTerm=1} as there is already a leader
ยป at org.opensearch.cluster.coordination.PreVoteCollector.handlePreVoteRequest(PreVoteCollector.java:162) ~[opensearch-2.0.1.jar:2.0.1]
ยป at org.opensearch.cluster.coordination.PreVoteCollector.lambda$new$0(PreVoteCollector.java:100) ~[opensearch-2.0.1.jar:2.0.1]
ยป at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:103) ~[opensearch-2.0.1.jar:2.0.1]
ยป at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-2.0.1.jar:2.0.1]
ยป at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:798) ~[opensearch-2.0.1.jar:2.0.1]
ยป at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.0.1.jar:2.0.1]
ยป at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
ยป at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
ยป at java.lang.Thread.run(Thread.java:833) [?:?]
Is this consistently reproducible? Did you try to debug it? The error looks suspicious:
rejecting PreVoteRequest{sourceNode={securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, currentTerm=1} as there is already a leader
could be related to master/clustermanager/leader renaming?
@cliu123 did you try running opensearch-min without security plugin? OpenSearch 3.0.0 is passing bwc tests with 2.0.1. Ref: https://github.com/opensearch-project/OpenSearch/blob/main/.ci/bwcVersions
@cliu123 did you try running opensearch-min without security plugin? OpenSearch 3.0.0 is passing bwc tests with 2.0.1. Ref: https://github.com/opensearch-project/OpenSearch/blob/main/.ci/bwcVersions
That's a good point! I haven't tried that as I don't see any security specific errors. But I'll try.
@saratvemulapalli Are there any passing BWC test runs from 2.0.1 to 3.0.0? I don't see any BWC runs in GitHub actions in this repo. Would you please share a pointer to the passing test run? I'd like to use it to investigate the failures in the security repo.
Sure, here is our testing process for bwc: https://github.com/opensearch-project/OpenSearch/blob/main/TESTING.md#testing-backwards-compatibility
We dont use Github workflows for testing, instead we use jenkins infra. Here is one successful run on a PR: https://github.com/opensearch-project/OpenSearch/pull/3618#issuecomment-1158357646 Logs: https://ci.opensearch.org/logs/ci/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/gradle_check_6087.log
@saratvemulapalli Thanks! But I don't see any tests upgrading to 3.0.0 in the reports/logs. The error included in the issue description shows that while the upgrade node tries to join back to the cluster, issues happen during master election(Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: rejecting PreVoteRequest
), which seems to be something wrong in the core engine.
Besides, without installing security plugin, I see the same error.
@cliu123 ./gradlew bwcTest
should run all the tests.
I did run it locally, 3.0.0 node was able to join a 2.0.1 cluster.
How are your tests setup ? Also looking at your change, there are some changes with inclusive naming.
Sure let me take a stab at this.
I see the same error
The setup has problems with settings, very likely it didnt disable security.
ยป java.lang.IllegalArgumentException: unknown setting [plugins.security.disabled] please check that any required plugins are installed, or check the breaking changes documentation for removed settings
ยป at org.opensearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:591)
ยป at org.opensearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:532)
ยป at org.opensearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:502)
ยป at org.opensearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:472)
ยป at org.opensearch.common.settings.SettingsModule.<init>(SettingsModule.java:170)
ยป at org.opensearch.node.Node.<init>(Node.java:479)
ยป at org.opensearch.node.Node.<init>(Node.java:339)
ยป at org.opensearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:242)
ยป at org.opensearch.bootstrap.Bootstrap.setup(Bootstrap.java:242)
ยป at org.opensearch.bootstrap.Bootstrap.init(Bootstrap.java:404)
ยป at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:180)
ยป at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:171)
ยป at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104)
ยป at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138)
ยป at org.opensearch.cli.Command.main(Command.java:101)
ยป at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:137)
ยป at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:103)
ยป For complete error details, refer to the log at /home/runner/work/security/security/bwc-test/build/testclusters/securityBwcCluster1-0/logs/securityBwcCluster1.log
@saratvemulapalli Would you please check or provide the logs upgrade from 2.0.1 to 3.0.0 without security plugin? I don't see the related logs in https://ci.opensearch.org/logs/ci/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/gradle_check_6087.log. Did I miss anything?
@Rishikesh1159 could you take a look at this.
@Rishikesh1159 could you take a look at this.
Sure
@saratvemulapalli Would you please check or provide the logs upgrade from 2.0.1 to 3.0.0 without security plugin? I don't see the related logs in https://ci.opensearch.org/logs/ci/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/gradle_check_6087.log. Did I miss anything?
@cliu123 You can see the the backward compatibility tests being run as a part of that gradle check:
$ grep 'bwcTest' ~/Downloads/gradle_check_6087.log
> Task :qa:verify-version-constants:v2.1.0#bwcTest
> Task :qa:verify-version-constants:v2.0.2#bwcTest
> Task :qa:verify-version-constants:bwcTestSnapshots
> Task :qa:repository-multi-version:v2.0.2#bwcTest
> Task :qa:repository-multi-version:v2.1.0#bwcTest
> Task :qa:repository-multi-version:bwcTestSnapshots
> Task :qa:rolling-upgrade:v2.1.0#bwcTest
> Task :qa:rolling-upgrade:bwcTestSnapshots
> Task :qa:full-cluster-restart:v2.0.2#bwcTest
> Task :qa:full-cluster-restart:v2.1.0#bwcTest
> Task :qa:full-cluster-restart:bwcTestSnapshots
> Task :qa:mixed-cluster:v2.1.0#bwcTest
> Task :qa:mixed-cluster:bwcTestSnapshots
Are there specific logs that you're looking for? I don't think the test output logs a whole lot when the tasks succeed.
Can someone please help with this issue? Its blocking security plugin version bump to 3.0
.
We want to build cross-cluster-replication
for 3.0
and since we've a dependency on security
plugin, we're blocked as well.
The BWC test failure still persists: https://github.com/cliu123/security/runs/7434973804?check_suite_focus=true. Looks like node failed to join back to the cluster after upgrading. @CEHENKLE Could anyone take a look?
@amitgalitz also got BWC test failures when upgrading to 3.0.0 with job-scheduler plugin installed but without security plugin installed: https://github.com/opensearch-project/job-scheduler/runs/7415033030?check_suite_focus=true.
k-NN plugin also has same issue with BWC Tests when we are trying to upgrade from 2.1.0 to 3.0.0-SNAPSHOT. But, the interesting thing is Restart Upgrade BWC Tests are working and Rolling Upgrade BWC Tests are failing. The link to GitHub Action - https://github.com/naveentatikonda/k-NN/runs/7508740179?check_suite_focus=true
@amitgalitz also got BWC test failures when upgrading to 3.0.0 with job-scheduler plugin installed but without security plugin installed: https://github.com/opensearch-project/job-scheduler/runs/7415033030?check_suite_focus=true.
@cliu123 @amitgalitz In the above logs it says that you are trying to upgrade from 7.10.2 to 3.0.0. I think we cannot upgrade directly from 7.x to 3.0.0. Could you pls try to upgrade from 2.x to 3.0.0-SNAPSHOT?
java.lang.IllegalStateException: cannot upgrade a node from version [7.10.2] directly to version [3.0.0]
ยป at org.opensearch.env.NodeMetadata.upgradeToCurrentVersion(NodeMetadata.java:101)
ยป at org.opensearch.env.NodeEnvironment.loadNodeMetadata(NodeEnvironment.java:476)
ยป at org.opensearch.env.NodeEnvironment.<init>(NodeEnvironment.java:369)
ยป at org.opensearch.node.Node.<init>(Node.java:433)
ยป at org.opensearch.node.Node.<init>(Node.java:342)
ยป at org.opensearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:242)
ยป at org.opensearch.bootstrap.Bootstrap.setup(Bootstrap.java:242)
ยป at org.opensearch.bootstrap.Bootstrap.init(Bootstrap.java:404)
ยป at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:180)
ยป at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:171)
ยป at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104)
ยป at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138)
ยป at org.opensearch.cli.Command.main(Command.java:101)
ยป at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:137)
ยป at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:103)
hi @naveentatikonda , yes for job-scheduler
its the same issue. i tried upgrading from 2.x
to 3.0.0
& faced the issue i linked.
today i'll test again to see if the issue is fixed.
I seemed to have run into the same issue with Notifications
as well when upgrading to 3.0.0
. The plugin passes all other tests and can come up with 3.0 but the BWC test fails with the same error.
@VachaShah (our oncall for this week) could you take a stab at this one? Looks like multiple plugins are impacted for 3.0.
I took a quick look at all of the failing runs, and I can't see a common root cause or pattern to them. Summarizing my observations below:
I re-ran the BWC tests for job-scheduler and the task :opensearch-job-scheduler-sample-extension:jobSchedulerBwcCluster#fullRestartClusterTask
fails because one of the nodes fails to start -
java.lang.IllegalStateException: cannot upgrade a node from version [7.10.2] directly to version [3.0.0]
As noted by @naveentatikonda please try upgrading from 2.x to 3.0.0-SNAPSHOT instead.
Looking at the BWC test run for the k-NN plugin, the :qa:rolling-upgrade:testAgainstOneThirdUpgradedCluster
task failed because the cluster didn't pass health checks after 40 seconds. However, the cluster does turn green at one point:
[2022-07-25T21:07:47,736][INFO ][o.o.c.r.a.AllocationService] [knnBwcCluster-rolling-1] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[knn-bwc-testknnwarmup][0]]])
but subsequently the 3.0.0-SNAPSHOT node is (for some reason) detected as disconnected, marked as faulty and removed:
[2022-07-25T21:07:48,446][INFO ][o.o.c.c.FollowersChecker ] [knnBwcCluster-rolling-1] FollowerChecker{discoveryNode={knnBwcCluster-rolling-0}{nRAPlFMNTI2Yt4E4YW60RA}{WvizLTUJT7q6e49dEjcrIA}{127.0.0.1}{127.0.0.1:38491}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, failureCountSinceLastSuccess=0, [cluster.fault_detection.follower_check.retry_count]=3} disconnected
ยป [2022-07-25T21:07:48,448][INFO ][o.o.c.c.FollowersChecker ] [knnBwcCluster-rolling-1] FollowerChecker{discoveryNode={knnBwcCluster-rolling-0}{nRAPlFMNTI2Yt4E4YW60RA}{WvizLTUJT7q6e49dEjcrIA}{127.0.0.1}{127.0.0.1:38491}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, failureCountSinceLastSuccess=0, [cluster.fault_detection.follower_check.retry_count]=3} marking node as faulty
ยป [2022-07-25T21:07:48,469][INFO ][o.o.c.s.MasterService ] [knnBwcCluster-rolling-1] node-left[{knnBwcCluster-rolling-0}{nRAPlFMNTI2Yt4E4YW60RA}{WvizLTUJT7q6e49dEjcrIA}{127.0.0.1}{127.0.0.1:38491}{dimr}{testattr=test, shard_indexing_pressure_enabled=true} reason: disconnected], term: 2, version: 41, delta: removed {{knnBwcCluster-rolling-0}{nRAPlFMNTI2Yt4E4YW60RA}{WvizLTUJT7q6e49dEjcrIA}{127.0.0.1}{127.0.0.1:38491}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}}
ยป [2022-07-25T21:07:48,470][DEBUG][o.o.c.c.PublicationTransportHandler] [knnBwcCluster-rolling-1] received diff cluster state version [41] with uuid [A5Mb175ASyqEcDfOOAWj4w], diff size [600]
ยป [2022-07-25T21:07:48,511][INFO ][o.o.c.s.ClusterApplierService] [knnBwcCluster-rolling-1] removed {{knnBwcCluster-rolling-0}{nRAPlFMNTI2Yt4E4YW60RA}{WvizLTUJT7q6e49dEjcrIA}{127.0.0.1}{127.0.0.1:38491}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}}, term: 2, version: 41, reason: Publication{term=2, version=41}
Meanwhile the 3.0.0-SNAPSHOt node is initially unable to resolve a ClusterManager. It ultimately starts up, but the timestamp seems too late:
[2022-07-25T21:07:58,382][INFO ][o.o.c.c.Coordinator ] [knnBwcCluster-rolling-0] cluster UUID [G8n67oeoTiWBJaYnbcY6-w]
ยป [2022-07-25T21:07:58,385][DEBUG][o.o.c.c.Coordinator ] [knnBwcCluster-rolling-0] startInitialJoin: coordinator becoming CANDIDATE in term 2 (was null, lastKnownLeader was [Optional.empty])
ยป [2022-07-25T21:07:58,406][INFO ][o.o.h.AbstractHttpServerTransport] [knnBwcCluster-rolling-0] publish_address {127.0.0.1:40253}, bound_addresses {[::1]:43789}, {127.0.0.1:40253}
ยป [2022-07-25T21:07:58,407][INFO ][o.o.n.Node ] [knnBwcCluster-rolling-0] started
The security plugin shows a similar chain of events to k-NN though the cluster never actually turns green in this case.
@kartg Thanks for looking into this and sharing your observations! I included the log that I had in the issue description. It was upgrading from 2.0.1 to 3.0.0, but failed.
hi @kartg , the current pr for job scheduler doesnt have the 2.0.1
fix.
the actual error is the one mentioned by @cliu123 above.
the mixedClusterTask
in bwc
tests fail.
here are some of the exception logs:
org.opensearch.discovery.ClusterManagerNotDiscoveredException: null
ยป at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:283) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
ยป at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
ยป at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
ยป at org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:697) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
ยป at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:739) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
[jobSchedulerBwcCluster0-0] cluster-manager not discovered yet, this node has not previously joined a bootstrapped cluster, and this node must discover cluster-manager-eligible nodes [jobSchedulerBwcCluster0-0, jobSchedulerBwcCluster0-1, jobSchedulerBwcCluster0-2] to bootstrap a cluster: have discovered [{jobSchedulerBwcCluster0-0}{9SouJ8iLTEyiTyEwwoHp8w}{YmkyQHPdS1KWJmMpVKxTiA}{127.0.0.1}{127.0.0.1:33145}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}]; discovery will continue using [] from hosts providers and [{jobSchedulerBwcCluster0-0}{9SouJ8iLTEyiTyEwwoHp8w}{YmkyQHPdS1KWJmMpVKxTiA}{127.0.0.1}{127.0.0.1:33145}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 0, last-accepted version 0 in term 0
ยป WARN ][o.o.c.c.ClusterFormationFailureHelper] [jobSchedulerBwcCluster0-0] cluster-manager not discovered or elected yet, an election requires at least 2 nodes with ids from [K4KL68aXRbWpqXO3eNH_Cg, E0bS9c3KTRumdXQmDzZdLw, 9SouJ8iLTEyiTyEwwoHp8w], have discovered [{jobSchedulerBwcCluster0-0}{9SouJ8iLTEyiTyEwwoHp8w}{uBo3iH55TOyBFimNH6tZ7g}{127.0.0.1}{127.0.0.1:36763}{dimr}{upgraded=true, testattr=test, shard_indexing_pressure_enabled=true}] which is not a quorum; discovery will continue using [[::1]:36639, 127.0.0.1:35113, [::1]:46143, 127.0.0.1:41555] from hosts providers and [{jobSchedulerBwcCluster0-0}{9SouJ8iLTEyiTyEwwoHp8w}{uBo3iH55TOyBFimNH6tZ7g}{127.0.0.1}{127.0.0.1:36763}{dimr}{upgraded=true, testattr=test, shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 3, last-accepted version 4 in term 3
ยป โ repeated 6 times โ
@cliu123 The BWC tests for OpenSearch seem to be having no issues. From the logs, it's not clear what is going on?
k-NN fails on the rolling upgrade tests :qa:rolling-upgrade:knnBwcCluster-rolling
while security plugin fails on the mixed cluster tests, :securityBwcCluster#mixedClusterTask
with different errors. The underlying cause may be the same, but it's not obvious.
I tried to reproduce the issue locally from the branch https://github.com/cliu123/security/tree/bump_version_to_3.0.0.0 but the build fails. Can you rebase the branch so we can build and reproduce the errors locally.
This is also blocking the work of AD Extension and Job Scheduler.
This is also blocking the work of AD Extension and Job Scheduler.
Do we have a branch for AD and/or Job Scheduler that we can use to reproduce the issue locally?
Do we have a branch for AD and/or Job Scheduler that we can use to reproduce the issue locally?
We can just update the main
branch of Job Scheduler to 3.0.0-SNAPSHOT and reproduce this issue.
For job-scheduler
I'm getting a dependency version conflict error with OpenSearch 3.0.0-SNAPSHOT
Execution failed for task ':compileTestJava'.
> Could not resolve all dependencies for configuration ':testCompileClasspath'.
> Conflict(s) found for the following module(s):
- commons-codec:commons-codec between versions 1.15 and 1.13
Run with:
--scan or
:dependencyInsight --configuration testCompileClasspath --dependency commons-codec:commons-codec
to get more insight on how to solve the conflict.
Related issue - https://github.com/opensearch-project/OpenSearch/issues/4336
That's weird. For job-scheduler
I'm getting BWC test failure on main
after updating OpenSearch to 3.0.0-SNAPSHOT
What went wrong:
Execution failed for task ':opensearch-job-scheduler-sample-extension:jobSchedulerBwcCluster#fullRestartClusterTask'.
> process was found dead while waiting for ports files, node{:opensearch-job-scheduler-sample-extension:jobSchedulerBwcCluster1-0}
* Try:
> Run with --scan to get full insights.
* Exception is:
org.gradle.api.tasks.TaskExecutionException: Execution failed for task ':opensearch-job-scheduler-sample-extension:jobSchedulerBwcCluster#fullRestartClusterTask'.
at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.lambda$executeIfValid$1(ExecuteActionsTaskExecuter.java:147)
at org.gradle.internal.Try$Failure.ifSuccessfulOrElse(Try.java:282)
at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.executeIfValid(ExecuteActionsTaskExecuter.java:145)
at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.execute(ExecuteActionsTaskExecuter.java:133)
at org.gradle.api.internal.tasks.execution.CleanupStaleOutputsExecuter.execute(CleanupStaleOutputsExecuter.java:77)
at org.gradle.api.internal.tasks.execution.FinalizePropertiesTaskExecuter.execute(FinalizePropertiesTaskExecuter.java:46)
That's weird. For
job-scheduler
I'm getting BWC test failure onmain
after updating OpenSearch to3.0.0-SNAPSHOT
Issue with my local gradle cache, now able to reproduce