elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

[CI] RestSqlIT testAsyncTextPaginated failing

Open droberts195 opened this issue 4 years ago • 9 comments

Although the same test, this failure differs from #76785, because that was a timeout whereas this one is "all shards failed". I wonder if this is yet another manifestation of #65846?

Build scan: https://gradle-enterprise.elastic.co/s/xehunhrnkfhhi/tests/:x-pack:plugin:sql:qa:server:security:with-ssl:integTest/org.elasticsearch.xpack.sql.qa.security.RestSqlIT/testAsyncTextPaginated

Reproduction line: ./gradlew ':x-pack:plugin:sql:qa:server:security:with-ssl:integTest' --tests "org.elasticsearch.xpack.sql.qa.security.RestSqlIT.testAsyncTextPaginated" -Dtests.seed=B1D5C515E1E4486D -Dtests.locale=hr-HR -Dtests.timezone=Asia/Irkutsk -Druntime.java=17

Applicable branches: master, 8.0, 7.16

Reproduces locally?: No

Failure history: https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.sql.qa.security.RestSqlIT&tests.test=testAsyncTextPaginated

Failure excerpt:

org.elasticsearch.client.ResponseException: method [GET], host [https://[::1]:45613], URI [/_sql/async/FkZhZEVPa2VyUmphUlRzQW5mMWYtX2cbb0U0bzB2UGlSWnVGZ2RRV2ZlU3dnQTo0MDMz?format=txt&wait_for_completion_timeout=1d], status line [HTTP/1.1 404 Not Found]
{"error":{"root_cause":[{"type":"search_context_missing_exception","reason":"No search context found for id [45]"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":-1,"index":null,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [45]"}}],"caused_by":{"type":"search_context_missing_exception","reason":"No search context found for id [45]"}},"status":404}

  at __randomizedtesting.SeedInfo.seed([B1D5C515E1E4486D:FB65C4A44320DED2]:0)
  at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:335)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:301)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:276)
  at org.elasticsearch.xpack.sql.qa.rest.RestSqlTestCase.testAsyncTextPaginated(RestSqlTestCase.java:1344)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:568)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
  at java.lang.Thread.run(Thread.java:833)

droberts195 avatar Oct 29 '21 13:10 droberts195

Pinging @elastic/es-ql (Team:QL)

elasticmachine avatar Oct 29 '21 13:10 elasticmachine

Same thing also happened on 8.0 - see https://gradle-enterprise.elastic.co/s/4sq2l2sm6y5yq/tests/:x-pack:plugin:sql:qa:server:security:without-ssl:integTest/org.elasticsearch.xpack.sql.qa.security.RestSqlIT/testAsyncTextPaginated

droberts195 avatar Oct 29 '21 14:10 droberts195

There are 4 test classes with the name RestSqlIT

  • org.elasticsearch.xpack.sql.qa.security.RestSqlIT
  • org.elasticsearch.xpack.sql.qa.multi_cluster_with_security.RestSqlIT
  • org.elasticsearch.xpack.sql.qa.single_node.RestSqlIT
  • org.elasticsearch.xpack.sql.qa.multi_node.RestSqlIT

All have had failures in the last 7 days apart from org.elasticsearch.xpack.sql.qa.multi_node.RestSqlIT

https://build-stats.elastic.co/app/kibana#/discover?_g=h@33d9122&_a=h@19cd33a

The test is muted in the base class in #80126

davidkyle avatar Nov 01 '21 11:11 davidkyle

The failures won't reproduce locally, but all CI failures contain the warning There are still tasks running after this test that might break subsequent tests (also not reproducible locally) after every test preceding this failing one, which I suppose has to do with the test failing.

bpintea avatar Nov 01 '21 12:11 bpintea

The leftover task is indices:data/read/sql/async/get.

What might be happening is:

  1. The previous test leaves an async search in progress when it completes
  2. The cleanup that runs between tests deletes all indices
  3. The leftover async search causes the async search results index to be created
  4. This test runs a search that then suffers all shards failed because the async search results index exists but is not ready to search yet, i.e. the problem of #65846

If you agree then the fix could be to wait for pending tasks to complete at the end of the previous test. If you search for waitForPendingTasks in IntelliJ you'll find a few places where ML does this in order to stop spillover from one test to the next.

Please also upvote #65846.

droberts195 avatar Nov 01 '21 12:11 droberts195

because the async search results index exists but is not ready to search yet

Maybe this scenario could happen, there are failures occurring when fetching the async results; but there are also failures triggered even by fetching the async task's status (like here and here), or even for starting the async job (like here). Then all failures of this test are 404s vs 503s (as in #65846) and the reason, when available is No search context found for id.

And it also seems these types of failures for this test started past the 27th last month (there are a couple of other failures, but time out-related).

So it might still be the suggested root cause, but I'd like to poke around a bit more.

bpintea avatar Nov 02 '21 21:11 bpintea

Pinging @elastic/es-analytical-engine (Team:Analytics)

elasticsearchmachine avatar Jan 17 '24 21:01 elasticsearchmachine

I tested both the failures reported in https://github.com/elastic/elasticsearch/issues/76785 as well as reported here for both main (8.15) and 7.17 and all passed locally. However, in one of the above comments it was pointed out that these failures only occur in CI, not locally, and they appear to be unclosed previous async queries. So I'm making a PR that unmutes the test as well as adds the suggested waitForPendingTasks(adminClient(), taskName -> taskName.startsWith("indices:data/read/sql/async/get"));.

craigtaverner avatar Apr 26 '24 13:04 craigtaverner

To see both 8.15 and 7.17 running in CI, I made two PRs:

  • https://github.com/elastic/elasticsearch/pull/107955
  • https://github.com/elastic/elasticsearch/pull/107956

craigtaverner avatar Apr 26 '24 15:04 craigtaverner