[CI] RestSqlIT testAsyncTextPaginated failing
Although the same test, this failure differs from #76785, because that was a timeout whereas this one is "all shards failed". I wonder if this is yet another manifestation of #65846?
Build scan: https://gradle-enterprise.elastic.co/s/xehunhrnkfhhi/tests/:x-pack:plugin:sql:qa:server:security:with-ssl:integTest/org.elasticsearch.xpack.sql.qa.security.RestSqlIT/testAsyncTextPaginated
Reproduction line:
./gradlew ':x-pack:plugin:sql:qa:server:security:with-ssl:integTest' --tests "org.elasticsearch.xpack.sql.qa.security.RestSqlIT.testAsyncTextPaginated" -Dtests.seed=B1D5C515E1E4486D -Dtests.locale=hr-HR -Dtests.timezone=Asia/Irkutsk -Druntime.java=17
Applicable branches: master, 8.0, 7.16
Reproduces locally?: No
Failure history: https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.sql.qa.security.RestSqlIT&tests.test=testAsyncTextPaginated
Failure excerpt:
org.elasticsearch.client.ResponseException: method [GET], host [https://[::1]:45613], URI [/_sql/async/FkZhZEVPa2VyUmphUlRzQW5mMWYtX2cbb0U0bzB2UGlSWnVGZ2RRV2ZlU3dnQTo0MDMz?format=txt&wait_for_completion_timeout=1d], status line [HTTP/1.1 404 Not Found]
{"error":{"root_cause":[{"type":"search_context_missing_exception","reason":"No search context found for id [45]"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":-1,"index":null,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [45]"}}],"caused_by":{"type":"search_context_missing_exception","reason":"No search context found for id [45]"}},"status":404}
at __randomizedtesting.SeedInfo.seed([B1D5C515E1E4486D:FB65C4A44320DED2]:0)
at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:335)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:301)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:276)
at org.elasticsearch.xpack.sql.qa.rest.RestSqlTestCase.testAsyncTextPaginated(RestSqlTestCase.java:1344)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:568)
at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
at java.lang.Thread.run(Thread.java:833)
Pinging @elastic/es-ql (Team:QL)
Same thing also happened on 8.0 - see https://gradle-enterprise.elastic.co/s/4sq2l2sm6y5yq/tests/:x-pack:plugin:sql:qa:server:security:without-ssl:integTest/org.elasticsearch.xpack.sql.qa.security.RestSqlIT/testAsyncTextPaginated
There are 4 test classes with the name RestSqlIT
- org.elasticsearch.xpack.sql.qa.security.RestSqlIT
- org.elasticsearch.xpack.sql.qa.multi_cluster_with_security.RestSqlIT
- org.elasticsearch.xpack.sql.qa.single_node.RestSqlIT
- org.elasticsearch.xpack.sql.qa.multi_node.RestSqlIT
All have had failures in the last 7 days apart from org.elasticsearch.xpack.sql.qa.multi_node.RestSqlIT
https://build-stats.elastic.co/app/kibana#/discover?_g=h@33d9122&_a=h@19cd33a
The test is muted in the base class in #80126
The failures won't reproduce locally, but all CI failures contain the warning There are still tasks running after this test that might break subsequent tests (also not reproducible locally) after every test preceding this failing one, which I suppose has to do with the test failing.
The leftover task is indices:data/read/sql/async/get.
What might be happening is:
- The previous test leaves an async search in progress when it completes
- The cleanup that runs between tests deletes all indices
- The leftover async search causes the async search results index to be created
- This test runs a search that then suffers
all shards failedbecause the async search results index exists but is not ready to search yet, i.e. the problem of #65846
If you agree then the fix could be to wait for pending tasks to complete at the end of the previous test. If you search for waitForPendingTasks in IntelliJ you'll find a few places where ML does this in order to stop spillover from one test to the next.
Please also upvote #65846.
because the async search results index exists but is not ready to search yet
Maybe this scenario could happen, there are failures occurring when fetching the async results; but there are also failures triggered even by fetching the async task's status (like here and here), or even for starting the async job (like here). Then all failures of this test are 404s vs 503s (as in #65846) and the reason, when available is No search context found for id.
And it also seems these types of failures for this test started past the 27th last month (there are a couple of other failures, but time out-related).
So it might still be the suggested root cause, but I'd like to poke around a bit more.
Pinging @elastic/es-analytical-engine (Team:Analytics)
I tested both the failures reported in https://github.com/elastic/elasticsearch/issues/76785 as well as reported here for both main (8.15) and 7.17 and all passed locally. However, in one of the above comments it was pointed out that these failures only occur in CI, not locally, and they appear to be unclosed previous async queries. So I'm making a PR that unmutes the test as well as adds the suggested waitForPendingTasks(adminClient(), taskName -> taskName.startsWith("indices:data/read/sql/async/get"));.
To see both 8.15 and 7.17 running in CI, I made two PRs:
- https://github.com/elastic/elasticsearch/pull/107955
- https://github.com/elastic/elasticsearch/pull/107956