elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

[CI] elasticsearch-ci/7.17.22 / bwc-snapshots-windows fails

Open ldematte opened this issue 9 months ago • 3 comments

CI Link

https://gradle-enterprise.elastic.co/s/au3yen3ihluxs

Repro line

N/A

Does it reproduce?

Didn't try locally, but it seems to fail pretty reliably (at least on my PR).

Applicable branches

7.17

Failure history

No response

Failure excerpt

The failure message is:

Execution failed for task ':qa:ccs-rolling-upgrade-remote-cluster:v7.17.22#oldClusterTest'.
> process was found dead while waiting for ports files, node{:qa:ccs-rolling-upgrade-remote-cluster:v7.17.22-local-0}

The root cause seems this:

[2024-05-09T16:15:33.909848600Z] [BUILD] Starting Elasticsearch process	
»  May 09, 2024 4:15:40 PM sun.util.locale.provider.LocaleProviderAdapter <clinit>	
»  WARNING: COMPAT locale provider will be removed in a future release	
»   ↑ repeated 2 times ↑	
» [2024-05-09T16:15:58,077][ERROR][o.e.b.Elasticsearch      ] [v7.17.22-local-0] fatal exception while booting Elasticsearch org.elasticsearch.ElasticsearchException: Failed to bind service	
»  	at [email protected]/org.elasticsearch.node.NodeConstruction.prepareConstruction(NodeConstruction.java:283)	
»  	at [email protected]/org.elasticsearch.node.Node.<init>(Node.java:192)	
»  	at [email protected]/org.elasticsearch.bootstrap.Elasticsearch$2.<init>(Elasticsearch.java:240)	
»  	at [email protected]/org.elasticsearch.bootstrap.Elasticsearch.initPhase3(Elasticsearch.java:240)	
»  	at [email protected]/org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:75)	
»  Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.15.0] is only supported from version [7.17.0].	
»  	at [email protected]/org.elasticsearch.env.NodeEnvironment.checkForIndexCompatibility(NodeEnvironment.java:517)	
»  	at [email protected]/org.elasticsearch.env.NodeEnvironment.upgradeLegacyNodeFolders(NodeEnvironment.java:416)	
»  	at [email protected]/org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:309)	
»  	at [email protected]/org.elasticsearch.node.NodeConstruction.validateSettings(NodeConstruction.java:511)	
»  	at [email protected]/org.elasticsearch.node.NodeConstruction.prepareConstruction(NodeConstruction.java:258)	
»  	... 4 more	
»  	
»  ERROR: Elasticsearch did not exit normally - check the logs at C:\bk\qa\ccs-rolling-upgrade-remote-cluster\build\testclusters\v7.17.22-local-0\logs\v7.17.22-local.log

Tagging both Core/Infra and Distributed, as it could be a version compatibility issue or a persisted cluster state issue - the comment line above the error says:

// We are upgrading the cluster, but we didn't find any previous metadata. Corrupted state or incompatible version.

Curiously, this seems to happen on Windows only?

ldematte avatar May 09 '24 17:05 ldematte

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticsearchmachine avatar May 09 '24 17:05 elasticsearchmachine

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine avatar May 09 '24 17:05 elasticsearchmachine

Lacking a better alternative, here is an "empty" PR on main that shows the issue: https://github.com/elastic/elasticsearch/pull/108490

ldematte avatar May 10 '24 06:05 ldematte

Created an instance using agent-instance.sh and hardcoding here it to use branch main.

Ran the failing test directly with ./gradlew :qa:ccs-rolling-upgrade-remote-cluster:v7.17.22#oldClusterTest (inside bash).

prdoyle avatar Jul 16 '24 17:07 prdoyle

That passed. I can try again with SHA 9f757fb17f43ffeb8ebf1ee1b0b85b3f99d99b40

prdoyle avatar Jul 16 '24 17:07 prdoyle

Just noticed I'm using a different Random Testing Seed. I should probably fix that.

Also, a note: the download of Gradle takes a few minutes, and then the build takes 11+ minutes, before the rest run even starts.

prdoyle avatar Jul 16 '24 17:07 prdoyle

I'm not sure what to make of this result:

 [7.17.22] BUILD SUCCESSFUL in 7m 48s                                                                                                                                                                                                                           
 [7.17.22] 467 actionable tasks: 467 executed

> Task :distribution:bwc:maintenance:buildBwcWindowsZip FAILED
Build Finished Action: Collecting archive files...

FAILURE: Build failed with an exception.                                                                                                                                                                                                                        

* What went wrong:
Execution failed for task ':distribution:bwc:maintenance:buildBwcWindowsZip'.
> Building 7.17.22 didn't generate expected artifact [distribution\bwc\maintenance\build\bwc\checkout-7.17\distribution\archives\windows-zip\build\install\elasticsearch-7.17.22-SNAPSHOT]. The working branch may be out-of-date - try merging in the latest upstream changes to the branch.

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Get more help at https://help.gradle.org.

BUILD FAILED in 13m 55s
585 actionable tasks: 585 executed

I don't think that's the same failure.

prdoyle avatar Jul 16 '24 18:07 prdoyle

Ok I got a failure with -Dtests.seed=3B3EB561961A02F3. Doesn't really look like the same problem but it's hard to tell.

It's a little hard to read because of the text rendering on Windows.

> Configure project :x-pack:qa:repository-old-versions
Disabling repository-old-versions tests because we can't get the pid file on windows
=======================================
Elasticsearch Build Hamster says Hello!
  Gradle Version        : 8.7
  OS Info               : Windows Server 2022 10.0 (amd64)
  JDK Version           : 17.0.2+8-86 (Oracle)
  JAVA_HOME             : C:\Users\buildkite\.java\openjdk17
  Random Testing Seed   : 3B3EB561961A02F3
  In FIPS 140 mode      : false
=======================================

> Task :distribution:bwc:maintenance:checkoutBwcBranch
Performing checkout of elastic/7.17...
Checkout hash for :distribution:bwc:maintenance is c364c6017c1b8156f3e66e7d1993b4a98810a2ce

> Task :qa:ccs-rolling-upgrade-remote-cluster:v7.17.22#oldClusterTest FAILED

=== Log output of node `node{:qa:ccs-rolling-upgrade-remote-cluster:v7.17.22-local-0}` ===

┬╗    Γåô errors and warnings from C:\Users\buildkite\dev\elasticsearch\qa\ccs-rolling-upgrade-remote-cluster\build\testclusters\v7.17.22-local-0\logs\es.out Γåô
┬╗ [2024-07-16T18:13:46.998947700Z] [BUILD] Starting Elasticsearch process
┬╗  Jul 16, 2024 6:13:51 PM sun.util.locale.provider.LocaleProviderAdapter <clinit>
┬╗  WARNING: COMPAT locale provider will be removed in a future release
┬╗ [2024-07-16T18:15:05,399][WARN ][o.e.d.FileBasedSeedHostsProvider] [v7.17.22-local-0] expected, but did not find, a dynamic hosts list at [C:\Users\buildkite\dev\elasticsearch\qa\ccs-rolling-upgrade-remote-cluster\build\testclusters\v7.17.22-local-0\config\unicast_hosts.txt]
┬╗   Γåæ repeated 46 times Γåæ
┬╗ [2024-07-16T18:15:00,148][WARN ][o.e.c.c.ClusterFormationFailureHelper] [v7.17.22-local-0] master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and this node must discover master-eligible nodes [v7.17.22-local-0, v7.17.22-local-1] to bootstrap a cluster: have discovered [{v7.17.22-local-0}{0mA6GJbuQGSXmenGjc5wMg}{aOxNRpo9Tpq29hUNWEykvw}{127.0.0.1}{127.0.0.1:51041}{cdfhilmrstw}]; discovery will continue using [] from hosts providers and [{v7.17.22-local-0}{0mA6GJbuQGSXmenGjc5wMg}{aOxNRpo9Tpq29hUNWEykvw}{127.0.0.1}{127.0.0.1:51041}{cdfhilmrstw}] from last-known cluster state; node term 0, last-accepted version 0 in term 0
┬╗   Γåæ repeated 4 times Γåæ
┬╗   Γåô last 40 non error or warning messages from C:\Users\buildkite\dev\elasticsearch\qa\ccs-rolling-upgrade-remote-cluster\build\testclusters\v7.17.22-local-0\logs\es.out Γåô
┬╗ [2024-07-16T18:15:05.953157900Z] [BUILD] Stopping node
Build Finished Action: Collecting archive files...

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':qa:ccs-rolling-upgrade-remote-cluster:v7.17.22#oldClusterTest'.
> Failed to create working directory for node{:qa:ccs-rolling-upgrade-remote-cluster:v7.17.22-local-1}, with: java.io.IOExceptionjava.io.IOException
        at org.elasticsearch.gradle.testclusters.ElasticsearchNode.deleteWithRetry0(ElasticsearchNode.java:1229)
        at org.elasticsearch.gradle.testclusters.ElasticsearchNode.deleteWithRetry(ElasticsearchNode.java:1183)                                                                                                                                                 
        at org.elasticsearch.gradle.testclusters.ElasticsearchNode.start(ElasticsearchNode.java:465)
        at java.base/java.lang.Iterable.forEach(Iterable.java:75)
        at org.elasticsearch.gradle.testclusters.ElasticsearchCluster.start(ElasticsearchCluster.java:421)
        at org.elasticsearch.gradle.testclusters.TestClustersRegistry.maybeStartCluster(TestClustersRegistry.java:42)                                                                                                                                           
        at java.base/java.lang.Iterable.forEach(Iterable.java:75)
        at org.elasticsearch.gradle.testclusters.TestClustersPlugin$TestClustersHookPlugin.lambda$configureStartClustersHook$7(TestClustersPlugin.java:244)
        at org.gradle.api.internal.AbstractTask$TaskActionWrapper.execute(AbstractTask.java:831)
        at org.gradle.api.internal.AbstractTask$TaskActionWrapper.execute(AbstractTask.java:804)

prdoyle avatar Jul 16 '24 18:07 prdoyle

I notice that my run says this:

Checkout hash for :distribution:bwc:maintenance is c364c6017c1b8156f3e66e7d1993b4a98810a2ce

The original says this:

Checkout hash for :distribution:bwc:maintenance is 2a327b1125d0248ebc813b71dc85d4cf1a55f08f

Not sure whether this could account for the different failure mode.

prdoyle avatar Jul 16 '24 18:07 prdoyle

I'm trying a re-run with Buildkite here: https://buildkite.com/elastic/elasticsearch-pull-request/builds/26040. I specified JAVA_TOOL_OPTIONS=-Dtests.seed=3B3EB561961A02F3

prdoyle avatar Jul 16 '24 18:07 prdoyle

Trying again with branch origin:main here: https://buildkite.com/elastic/elasticsearch-pull-request/builds/26041

prdoyle avatar Jul 16 '24 18:07 prdoyle

I don't understand what Buildkite wants for the branch name. 🤔

prdoyle avatar Jul 16 '24 18:07 prdoyle

Opened a dummy PR with test-windows label to trigger a build, which is here.

prdoyle avatar Jul 16 '24 19:07 prdoyle

Making a new agent instance from this build step.

prdoyle avatar Jul 17 '24 12:07 prdoyle

Running this in bash: JAVA_TOOL_OPTIONS=-Dtests.seed=3B3EB561961A02F3 ./gradlew :qa:ccs-rolling-upgrade-remote-cluster:v7.17.22#oldClusterTest

prdoyle avatar Jul 17 '24 12:07 prdoyle

My run here does not contain any of these strings:

  • process was found dead
  • Failed to bind service
  • any previous metadata

The test did fail though, with a different symptom:

  | VectorSystemPropertyTests > testSystemPropertyDisabled FAILED
  | java.security.AccessControlException: access denied ("java.io.FilePermission" "C:\Users\buildkite\.gradle\jdks\oracle_corporation-22-amd64-windows\jdk-22.0.1\bin\java" "execute")
  | at __randomizedtesting.SeedInfo.seed([CF3F79C9F5F89BB7:BAFB1329C6381DF2]:0)
  | at java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:488)
  | at java.base/java.security.AccessController.checkPermission(AccessController.java:1085)
  | at java.base/java.lang.SecurityManager.checkPermission(SecurityManager.java:411)
  | at java.base/java.lang.SecurityManager.checkExec(SecurityManager.java:650)
  | at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1115)
  | at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1089)
  | at org.elasticsearch.nativeaccess.VectorSystemPropertyTests.testSystemPropertyDisabled(VectorSystemPropertyTests.java:54)

prdoyle avatar Jul 17 '24 13:07 prdoyle

Relevant comment here. I think the build I'm looking at is not the most relevant.

Hmm I'm a bit confused.

This build, which contains errors, is based on SHA 154a78dfe2ba, which I think isn't the most relevant. I'll have to double check where that one came from.

This build contains no errors, and is based on SHA 02f96060cae4, which is the one I meant to test, based on the latest main.

Based on that latter build, we might be able to conclude that this failure is no longer occurring. Perhaps it was already fixed upstream.

Let me try one more time with an agent-instance and the original random seed, just to verify.

prdoyle avatar Jul 17 '24 14:07 prdoyle

See https://github.com/elastic/elasticsearch/issues/110949. The vector failure you see should be muted now. I suggest updating and trying again.

rjernst avatar Jul 17 '24 14:07 rjernst

My run passed.

Command: JAVA_TOOL_OPTIONS=-Dtests.seed=3B3EB561961A02F3 ./gradlew :qa:ccs-rolling-upgrade-remote-cluster:v7.17.22#oldClusterTest

=======================================
Elasticsearch Build Hamster says Hello!
  Gradle Version        : 8.8
  OS Info               : Windows Server 2022 10.0 (amd64)
  JDK Version           : 17.0.2+8-86 (Oracle)
  JAVA_HOME             : C:\Users\buildkite\.java\openjdk17
  Random Testing Seed   : 3B3EB561961A02F3
  In FIPS 140 mode      : false
=======================================

prdoyle avatar Jul 17 '24 14:07 prdoyle

Oh ok. I've merged main. Trying again.

prdoyle avatar Jul 17 '24 14:07 prdoyle

@rjernst - I think this means the VectorSystemPropertyTests failure is ignorable... would it interfere with the test I'm trying to run? 🤔

prdoyle avatar Jul 17 '24 15:07 prdoyle

I investigated this issue several weeks ago as I was seeing it in one of my PRs (https://github.com/elastic/elasticsearch/pull/108970, which ran full bwc tests).

The root of the issue had nothing to do with the current main code, but instead to do with the bwc test structure. Normally rest tests are run on each phase of the bwc tests. But this particular ccs test starts up a node that is only intended to advance to the current version. Since no rest tests are present, nothing waits on the node actually completing startup, and it is killed before it writes out state with the current version. However, the node gets far enough that it writes out a lock file. When the newer node starts up, it sees the lock file, then looks for state. The error about mismatched versions is actually the result of the state file missing (the error message is bogus). Dave Turner filed an issue to make the startup code more resilient to this situation: https://github.com/elastic/elasticsearch/issues/109544.

I believe my PR actually fixed the test issue, see my workaround to force the test to wait until the node is fully started up before proceeding: https://github.com/elastic/elasticsearch/pull/108970/files#diff-42d5e706e1699b8aabe8bbaf2f15cc657ed136f1e66e9905215114d12fe93be1R61

rjernst avatar Jul 17 '24 16:07 rjernst

Alright thanks @rjernst. I'll close it.

prdoyle avatar Jul 18 '24 00:07 prdoyle