elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

[CI] SearchableSnapshotsCanMatchOnCoordinatorIntegTests testSearchableSnapshotShardsThatHaveMatchingDataAreNotSkippedOnTheCoordinatingNode failing

Open gmarouli opened this issue 1 year ago • 4 comments

This test doesn't fail often but there have been 8 fails in history starting on the 27th of December 2023 (I checked 90 days).

Build scan: https://gradle-enterprise.elastic.co/s/37emgjkj2kabs/tests/:x-pack:plugin:searchable-snapshots:internalClusterTest/org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsCanMatchOnCoordinatorIntegTests/testSearchableSnapshotShardsThatHaveMatchingDataAreNotSkippedOnTheCoordinatingNode

Reproduction line:

./gradlew ':x-pack:plugin:searchable-snapshots:internalClusterTest' --tests "org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsCanMatchOnCoordinatorIntegTests.testSearchableSnapshotShardsThatHaveMatchingDataAreNotSkippedOnTheCoordinatingNode" -Dtests.seed=DF618ECA1E13C7B8 -Dtests.locale=ko -Dtests.timezone=Canada/Saskatchewan -Druntime.java=21

Applicable branches: main

Reproduces locally?: Didn't try

Failure history: Failure dashboard for org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsCanMatchOnCoordinatorIntegTests#testSearchableSnapshotShardsThatHaveMatchingDataAreNotSkippedOnTheCoordinatingNode

Failure excerpt:

java.lang.AssertionError: no shard should be marked as skipped

  at __randomizedtesting.SeedInfo.seed([DF618ECA1E13C7B8:1C94DFDF97D14973]:0)
  at org.junit.Assert.fail(Assert.java:89)
  at org.junit.Assert.assertTrue(Assert.java:42)
  at org.junit.Assert.assertFalse(Assert.java:65)
  at org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsCanMatchOnCoordinatorIntegTests.testSearchableSnapshotShardsThatHaveMatchingDataAreNotSkippedOnTheCoordinatingNode(SearchableSnapshotsCanMatchOnCoordinatorIntegTests.java:667)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)

gmarouli avatar Feb 09 '24 11:02 gmarouli

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine avatar Feb 09 '24 11:02 elasticsearchmachine

I've muted this in https://github.com/elastic/elasticsearch/commit/4b5f0a0bb7eac6f807876a96bd5a350c505907b8.

rjernst avatar Feb 12 '24 16:02 rjernst

The failing SearchShards API assertion was added in #97212, so I'm going to reassign the test failure to the Search team.

arteam avatar Feb 15 '24 19:02 arteam

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine avatar Feb 15 '24 19:02 elasticsearchmachine

Few observation from time spent debugging this test

  • Repository is blocked before searchable snapshot is mounted. Reading of blocking flag happens after. Thread synchronization is correct as for me.
  • search request (first and then second) can run concurrently with mounting snapshot operation which runs in background
  • mounting snapshot operation cannot complete because of locked repository, therefore two search requests are expected to fail
  • Nevertheless this condition looks suspicious
        {
            SearchShardsResponse searchShardsResponse = null;
            try {
                searchShardsResponse = client().execute(TransportSearchShardsAction.TYPE, searchShardsRequest).actionGet();
            } catch (SearchPhaseExecutionException e) {
                // ignore as this is expected to happen
            }
            if (searchShardsResponse != null) {
                for (SearchShardsGroup group : searchShardsResponse.getGroups()) {
                    assertFalse("no shard should be marked as skipped", group.skipped());
                }
            }
        }

somehow searchShardsResponse got returned but it is expected to contain no skipped groups. Here I'd like ES-Search team to explain how/why is can happen that searchShardsResponse contains some results despite the fact that underlying shard(s) do not exists at time of query

volodk85 avatar Feb 23 '24 22:02 volodk85

Possibly related failure: https://gradle-enterprise.elastic.co/s/5o5anthxsfi6k/tests/task/:x-pack:plugin:searchable-snapshots:internalClusterTest/details/org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsCanMatchOnCoordinatorIntegTests/testSearchableSnapshotShardsThatHaveMatchingDataAreNotSkippedOnTheCoordinatingNode?page=eyJvdXRwdXQiOnsiMCI6MX19&top-execution=1

Test is the same, but the exception is different. Let me know if it's worth creating a separate issue for it

ldematte avatar Mar 22 '24 17:03 ldematte

Pinging @elastic/es-search-foundations (Team:Search Foundations)

elasticsearchmachine avatar Jul 17 '24 18:07 elasticsearchmachine