elasticsearch
elasticsearch copied to clipboard
[CI] elasticsearch-ci/7.17.22 / bwc-snapshots-windows fails
CI Link
https://gradle-enterprise.elastic.co/s/au3yen3ihluxs
Repro line
N/A
Does it reproduce?
Didn't try locally, but it seems to fail pretty reliably (at least on my PR).
Applicable branches
7.17
Failure history
No response
Failure excerpt
The failure message is:
Execution failed for task ':qa:ccs-rolling-upgrade-remote-cluster:v7.17.22#oldClusterTest'.
> process was found dead while waiting for ports files, node{:qa:ccs-rolling-upgrade-remote-cluster:v7.17.22-local-0}
The root cause seems this:
[2024-05-09T16:15:33.909848600Z] [BUILD] Starting Elasticsearch process
» May 09, 2024 4:15:40 PM sun.util.locale.provider.LocaleProviderAdapter <clinit>
» WARNING: COMPAT locale provider will be removed in a future release
» ↑ repeated 2 times ↑
» [2024-05-09T16:15:58,077][ERROR][o.e.b.Elasticsearch ] [v7.17.22-local-0] fatal exception while booting Elasticsearch org.elasticsearch.ElasticsearchException: Failed to bind service
» at [email protected]/org.elasticsearch.node.NodeConstruction.prepareConstruction(NodeConstruction.java:283)
» at [email protected]/org.elasticsearch.node.Node.<init>(Node.java:192)
» at [email protected]/org.elasticsearch.bootstrap.Elasticsearch$2.<init>(Elasticsearch.java:240)
» at [email protected]/org.elasticsearch.bootstrap.Elasticsearch.initPhase3(Elasticsearch.java:240)
» at [email protected]/org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:75)
» Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.15.0] is only supported from version [7.17.0].
» at [email protected]/org.elasticsearch.env.NodeEnvironment.checkForIndexCompatibility(NodeEnvironment.java:517)
» at [email protected]/org.elasticsearch.env.NodeEnvironment.upgradeLegacyNodeFolders(NodeEnvironment.java:416)
» at [email protected]/org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:309)
» at [email protected]/org.elasticsearch.node.NodeConstruction.validateSettings(NodeConstruction.java:511)
» at [email protected]/org.elasticsearch.node.NodeConstruction.prepareConstruction(NodeConstruction.java:258)
» ... 4 more
»
» ERROR: Elasticsearch did not exit normally - check the logs at C:\bk\qa\ccs-rolling-upgrade-remote-cluster\build\testclusters\v7.17.22-local-0\logs\v7.17.22-local.log
Tagging both Core/Infra and Distributed, as it could be a version compatibility issue or a persisted cluster state issue - the comment line above the error says:
// We are upgrading the cluster, but we didn't find any previous metadata. Corrupted state or incompatible version.
Curiously, this seems to happen on Windows only?
Pinging @elastic/es-core-infra (Team:Core/Infra)
Pinging @elastic/es-distributed (Team:Distributed)
Lacking a better alternative, here is an "empty" PR on main that shows the issue: https://github.com/elastic/elasticsearch/pull/108490
Created an instance using agent-instance.sh and hardcoding here it to use branch main
.
Ran the failing test directly with ./gradlew :qa:ccs-rolling-upgrade-remote-cluster:v7.17.22#oldClusterTest
(inside bash).
That passed. I can try again with SHA 9f757fb17f43ffeb8ebf1ee1b0b85b3f99d99b40
Just noticed I'm using a different Random Testing Seed
. I should probably fix that.
Also, a note: the download of Gradle takes a few minutes, and then the build takes 11+ minutes, before the rest run even starts.
I'm not sure what to make of this result:
[7.17.22] BUILD SUCCESSFUL in 7m 48s
[7.17.22] 467 actionable tasks: 467 executed
> Task :distribution:bwc:maintenance:buildBwcWindowsZip FAILED
Build Finished Action: Collecting archive files...
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':distribution:bwc:maintenance:buildBwcWindowsZip'.
> Building 7.17.22 didn't generate expected artifact [distribution\bwc\maintenance\build\bwc\checkout-7.17\distribution\archives\windows-zip\build\install\elasticsearch-7.17.22-SNAPSHOT]. The working branch may be out-of-date - try merging in the latest upstream changes to the branch.
* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Get more help at https://help.gradle.org.
BUILD FAILED in 13m 55s
585 actionable tasks: 585 executed
I don't think that's the same failure.
Ok I got a failure with -Dtests.seed=3B3EB561961A02F3
. Doesn't really look like the same problem but it's hard to tell.
It's a little hard to read because of the text rendering on Windows.
> Configure project :x-pack:qa:repository-old-versions
Disabling repository-old-versions tests because we can't get the pid file on windows
=======================================
Elasticsearch Build Hamster says Hello!
Gradle Version : 8.7
OS Info : Windows Server 2022 10.0 (amd64)
JDK Version : 17.0.2+8-86 (Oracle)
JAVA_HOME : C:\Users\buildkite\.java\openjdk17
Random Testing Seed : 3B3EB561961A02F3
In FIPS 140 mode : false
=======================================
> Task :distribution:bwc:maintenance:checkoutBwcBranch
Performing checkout of elastic/7.17...
Checkout hash for :distribution:bwc:maintenance is c364c6017c1b8156f3e66e7d1993b4a98810a2ce
> Task :qa:ccs-rolling-upgrade-remote-cluster:v7.17.22#oldClusterTest FAILED
=== Log output of node `node{:qa:ccs-rolling-upgrade-remote-cluster:v7.17.22-local-0}` ===
┬╗ Γåô errors and warnings from C:\Users\buildkite\dev\elasticsearch\qa\ccs-rolling-upgrade-remote-cluster\build\testclusters\v7.17.22-local-0\logs\es.out Γåô
┬╗ [2024-07-16T18:13:46.998947700Z] [BUILD] Starting Elasticsearch process
┬╗ Jul 16, 2024 6:13:51 PM sun.util.locale.provider.LocaleProviderAdapter <clinit>
┬╗ WARNING: COMPAT locale provider will be removed in a future release
┬╗ [2024-07-16T18:15:05,399][WARN ][o.e.d.FileBasedSeedHostsProvider] [v7.17.22-local-0] expected, but did not find, a dynamic hosts list at [C:\Users\buildkite\dev\elasticsearch\qa\ccs-rolling-upgrade-remote-cluster\build\testclusters\v7.17.22-local-0\config\unicast_hosts.txt]
┬╗ Γåæ repeated 46 times Γåæ
┬╗ [2024-07-16T18:15:00,148][WARN ][o.e.c.c.ClusterFormationFailureHelper] [v7.17.22-local-0] master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and this node must discover master-eligible nodes [v7.17.22-local-0, v7.17.22-local-1] to bootstrap a cluster: have discovered [{v7.17.22-local-0}{0mA6GJbuQGSXmenGjc5wMg}{aOxNRpo9Tpq29hUNWEykvw}{127.0.0.1}{127.0.0.1:51041}{cdfhilmrstw}]; discovery will continue using [] from hosts providers and [{v7.17.22-local-0}{0mA6GJbuQGSXmenGjc5wMg}{aOxNRpo9Tpq29hUNWEykvw}{127.0.0.1}{127.0.0.1:51041}{cdfhilmrstw}] from last-known cluster state; node term 0, last-accepted version 0 in term 0
┬╗ Γåæ repeated 4 times Γåæ
┬╗ Γåô last 40 non error or warning messages from C:\Users\buildkite\dev\elasticsearch\qa\ccs-rolling-upgrade-remote-cluster\build\testclusters\v7.17.22-local-0\logs\es.out Γåô
┬╗ [2024-07-16T18:15:05.953157900Z] [BUILD] Stopping node
Build Finished Action: Collecting archive files...
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':qa:ccs-rolling-upgrade-remote-cluster:v7.17.22#oldClusterTest'.
> Failed to create working directory for node{:qa:ccs-rolling-upgrade-remote-cluster:v7.17.22-local-1}, with: java.io.IOExceptionjava.io.IOException
at org.elasticsearch.gradle.testclusters.ElasticsearchNode.deleteWithRetry0(ElasticsearchNode.java:1229)
at org.elasticsearch.gradle.testclusters.ElasticsearchNode.deleteWithRetry(ElasticsearchNode.java:1183)
at org.elasticsearch.gradle.testclusters.ElasticsearchNode.start(ElasticsearchNode.java:465)
at java.base/java.lang.Iterable.forEach(Iterable.java:75)
at org.elasticsearch.gradle.testclusters.ElasticsearchCluster.start(ElasticsearchCluster.java:421)
at org.elasticsearch.gradle.testclusters.TestClustersRegistry.maybeStartCluster(TestClustersRegistry.java:42)
at java.base/java.lang.Iterable.forEach(Iterable.java:75)
at org.elasticsearch.gradle.testclusters.TestClustersPlugin$TestClustersHookPlugin.lambda$configureStartClustersHook$7(TestClustersPlugin.java:244)
at org.gradle.api.internal.AbstractTask$TaskActionWrapper.execute(AbstractTask.java:831)
at org.gradle.api.internal.AbstractTask$TaskActionWrapper.execute(AbstractTask.java:804)
I notice that my run says this:
Checkout hash for :distribution:bwc:maintenance is c364c6017c1b8156f3e66e7d1993b4a98810a2ce
The original says this:
Checkout hash for :distribution:bwc:maintenance is 2a327b1125d0248ebc813b71dc85d4cf1a55f08f
Not sure whether this could account for the different failure mode.
I'm trying a re-run with Buildkite here: https://buildkite.com/elastic/elasticsearch-pull-request/builds/26040. I specified JAVA_TOOL_OPTIONS=-Dtests.seed=3B3EB561961A02F3
Trying again with branch origin:main
here: https://buildkite.com/elastic/elasticsearch-pull-request/builds/26041
I don't understand what Buildkite wants for the branch name. 🤔
Making a new agent instance from this build step.
Running this in bash: JAVA_TOOL_OPTIONS=-Dtests.seed=3B3EB561961A02F3 ./gradlew :qa:ccs-rolling-upgrade-remote-cluster:v7.17.22#oldClusterTest
My run here does not contain any of these strings:
-
process was found dead
-
Failed to bind service
-
any previous metadata
The test did fail though, with a different symptom:
| VectorSystemPropertyTests > testSystemPropertyDisabled FAILED
| java.security.AccessControlException: access denied ("java.io.FilePermission" "C:\Users\buildkite\.gradle\jdks\oracle_corporation-22-amd64-windows\jdk-22.0.1\bin\java" "execute")
| at __randomizedtesting.SeedInfo.seed([CF3F79C9F5F89BB7:BAFB1329C6381DF2]:0)
| at java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:488)
| at java.base/java.security.AccessController.checkPermission(AccessController.java:1085)
| at java.base/java.lang.SecurityManager.checkPermission(SecurityManager.java:411)
| at java.base/java.lang.SecurityManager.checkExec(SecurityManager.java:650)
| at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1115)
| at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1089)
| at org.elasticsearch.nativeaccess.VectorSystemPropertyTests.testSystemPropertyDisabled(VectorSystemPropertyTests.java:54)
Relevant comment here. I think the build I'm looking at is not the most relevant.
Hmm I'm a bit confused.
This build, which contains errors, is based on SHA 154a78dfe2ba, which I think isn't the most relevant. I'll have to double check where that one came from.
This build contains no errors, and is based on SHA 02f96060cae4, which is the one I meant to test, based on the latest
main
.Based on that latter build, we might be able to conclude that this failure is no longer occurring. Perhaps it was already fixed upstream.
Let me try one more time with an agent-instance and the original random seed, just to verify.
See https://github.com/elastic/elasticsearch/issues/110949. The vector failure you see should be muted now. I suggest updating and trying again.
My run passed.
Command: JAVA_TOOL_OPTIONS=-Dtests.seed=3B3EB561961A02F3 ./gradlew :qa:ccs-rolling-upgrade-remote-cluster:v7.17.22#oldClusterTest
=======================================
Elasticsearch Build Hamster says Hello!
Gradle Version : 8.8
OS Info : Windows Server 2022 10.0 (amd64)
JDK Version : 17.0.2+8-86 (Oracle)
JAVA_HOME : C:\Users\buildkite\.java\openjdk17
Random Testing Seed : 3B3EB561961A02F3
In FIPS 140 mode : false
=======================================
Oh ok. I've merged main
. Trying again.
@rjernst - I think this means the VectorSystemPropertyTests
failure is ignorable... would it interfere with the test I'm trying to run? 🤔
I investigated this issue several weeks ago as I was seeing it in one of my PRs (https://github.com/elastic/elasticsearch/pull/108970, which ran full bwc tests).
The root of the issue had nothing to do with the current main code, but instead to do with the bwc test structure. Normally rest tests are run on each phase of the bwc tests. But this particular ccs test starts up a node that is only intended to advance to the current version. Since no rest tests are present, nothing waits on the node actually completing startup, and it is killed before it writes out state with the current version. However, the node gets far enough that it writes out a lock file. When the newer node starts up, it sees the lock file, then looks for state. The error about mismatched versions is actually the result of the state file missing (the error message is bogus). Dave Turner filed an issue to make the startup code more resilient to this situation: https://github.com/elastic/elasticsearch/issues/109544.
I believe my PR actually fixed the test issue, see my workaround to force the test to wait until the node is fully started up before proceeding: https://github.com/elastic/elasticsearch/pull/108970/files#diff-42d5e706e1699b8aabe8bbaf2f15cc657ed136f1e66e9905215114d12fe93be1R61
Alright thanks @rjernst. I'll close it.