elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

[CI] RestartIndexFollowingIT testFollowIndex failing

Open kkrik-es opened this issue 10 months ago • 3 comments

Build scan: https://gradle-enterprise.elastic.co/s/4ayosvibial2k/tests/:x-pack:plugin:ccr:internalClusterTest/org.elasticsearch.xpack.ccr.RestartIndexFollowingIT/testFollowIndex

Reproduction line:

./gradlew ':x-pack:plugin:ccr:internalClusterTest' --tests "org.elasticsearch.xpack.ccr.RestartIndexFollowingIT.testFollowIndex" -Dtests.seed=A67E2A734CCC578 -Dtests.locale=sk-SK -Dtests.timezone=America/Managua -Druntime.java=17

Applicable branches: main

Reproduces locally?: Didn't try

Failure history: Failure dashboard for org.elasticsearch.xpack.ccr.RestartIndexFollowingIT#testFollowIndex

Failure excerpt:

java.lang.AssertionError: Count is 143 hits but 152 was expected.  Total shards: 8 Successful shards: 8 & 0 shard failures:

  at __randomizedtesting.SeedInfo.seed([A67E2A734CCC578:E90B2D950D74FCF2]:0)
  at org.junit.Assert.fail(Assert.java:89)
  at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertHitCount(ElasticsearchAssertions.java:309)
  at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.lambda$assertHitCount$5(ElasticsearchAssertions.java:295)
  at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertResponse(ElasticsearchAssertions.java:375)
  at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertHitCount(ElasticsearchAssertions.java:295)
  at org.elasticsearch.xpack.ccr.RestartIndexFollowingIT.lambda$testFollowIndex$1(RestartIndexFollowingIT.java:117)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1278)
  at org.elasticsearch.xpack.ccr.RestartIndexFollowingIT.testFollowIndex(RestartIndexFollowingIT.java:104)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:568)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:833)

kkrik-es avatar Apr 16 '24 06:04 kkrik-es

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine avatar Apr 16 '24 06:04 elasticsearchmachine

This is still happening: https://gradle-enterprise.elastic.co/s/qii4jqdjvu32u

kkrik-es avatar May 01 '24 06:05 kkrik-es

I was checking the failures and every time we have a mismatch of docs count it is due to something like:

[2024-03-13T13:33:54,641][WARN ][o.e.x.c.a.ShardFollowNodeTask] [follower3] shard follow task encounter non-retryable error
org.elasticsearch.index.IndexNotFoundException: no such index [index1]
	at org.elasticsearch.cluster.metadata.Metadata.getIndexSafe(Metadata.java:1238) ~[elasticsearch-8.12.3-SNAPSHOT.jar:8.12.3-SNAPSHOT]
	at org.elasticsearch.xpack.ccr.action.ShardFollowTasksExecutor$1.lambda$innerUpdateSettings$6(ShardFollowTasksExecutor.java:204) ~[main/:?]

[2024-03-13T13:33:54,643][WARN ][o.e.x.c.a.ShardFollowNodeTask] [follower3] shard follow task encounter non-retryable error
org.elasticsearch.index.IndexNotFoundException: no such index [index1]
	at org.elasticsearch.cluster.metadata.Metadata.getIndexSafe(Metadata.java:1238) ~[elasticsearch-8.12.3-SNAPSHOT.jar:8.12.3-SNAPSHOT]
	at org.elasticsearch.xpack.ccr.action.ShardFollowTasksExecutor$1.lambda$innerUpdateAliases$8(ShardFollowTasksExecutor.java:284) ~[main/:?]

[2024-03-13T13:33:54,645][WARN ][o.e.x.c.a.ShardFollowNodeTask] [follower3] shard follow task encounter non-retryable error
org.elasticsearch.index.IndexNotFoundException: no such index [index1]
	at org.elasticsearch.cluster.metadata.Metadata.getIndexSafe(Metadata.java:1238) ~[elasticsearch-8.12.3-SNAPSHOT.jar:8.12.3-SNAPSHOT]
	at org.elasticsearch.xpack.ccr.action.ShardFollowTasksExecutor$1.lambda$innerUpdateSettings$6(ShardFollowTasksExecutor.java:204) ~[main/:?]

[2024-03-13T13:33:54,654][WARN ][o.e.x.c.a.ShardFollowNodeTask] [follower3] shard follow task encounter non-retryable error
org.elasticsearch.index.IndexNotFoundException: no such index [index1]
	at org.elasticsearch.cluster.metadata.Metadata.getIndexSafe(Metadata.java:1238) ~[elasticsearch-8.12.3-SNAPSHOT.jar:8.12.3-SNAPSHOT]
	at org.elasticsearch.xpack.ccr.action.ShardFollowTasksExecutor$1.lambda$innerUpdateAliases$8(ShardFollowTasksExecutor.java:284) ~[main/:?]

see: https://gradle-enterprise.elastic.co/s/qzwwh6utni65q/tests/task/:x-pack:plugin:ccr:internalClusterTest/details/org.elasticsearch.xpack.ccr.RestartIndexFollowingIT/testFollowIndex/1/output

There is nothing in leader logs indicating that index was deleted. The leader node was restarting in meanwhile. Is there a way we could return incomplete/empty or a cluster state from a node with no master during this period of time?

idegtiarenko avatar May 16 '24 14:05 idegtiarenko

The failure is likely happening as we see the unrecovered cluster state that does not contain the index we follow. This is a bug in ShardFollowNodeTask we need to fix.

idegtiarenko avatar May 17 '24 09:05 idegtiarenko