elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

[CI] DocsClientYamlTestSuiteIT test {yaml=reference/sql/endpoints/rest/line_574} failing

Open jfreden opened this issue 11 months ago • 14 comments

Build scan: https://gradle-enterprise.elastic.co/s/u3ln5vofvnbtw/tests/:docs:yamlRestTest/org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT/test%20%7Byaml=reference%2Fsql%2Fendpoints%2Frest%2Fline_574%7D

Reproduction line:

./gradlew ':docs:yamlRestTest' --tests "org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT.test {yaml=reference/sql/endpoints/rest/line_574}" -Dtests.seed=461A50DBB2EAB7FF -Dtests.locale=ar-SD -Dtests.timezone=SystemV/EST5 -Druntime.java=21

Applicable branches: main

Reproduces locally?: Yes

Failure history: Failure dashboard for org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT#test {yaml=reference/sql/endpoints/rest/line_574}

Failure excerpt:

org.elasticsearch.client.WarningFailureException: method [DELETE], host [http://127.0.0.1:39763], URI [*,-.ds-ilm-history-*,-.ds-.slm-history-*?expand_wildcards=open%2Cclosed%2Chidden], status line [HTTP/1.1 200 OK]
Warnings: [this request accesses system indices: [.async-search], but in a future major version, direct access to system indices will be prevented by default]
{"acknowledged":true}

  at __randomizedtesting.SeedInfo.seed([461A50DBB2EAB7FF:CE4E6F011C16DA07]:0)
  at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:317)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:292)
  at org.elasticsearch.test.rest.ESRestTestCase.wipeAllIndices(ESRestTestCase.java:1113)
  at org.elasticsearch.test.rest.ESRestTestCase.wipeCluster(ESRestTestCase.java:853)
  at org.elasticsearch.test.rest.ESRestTestCase.cleanUpCluster(ESRestTestCase.java:539)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:1004)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)

jfreden avatar Mar 11 '24 08:03 jfreden

Pinging @elastic/es-analytical-engine (Team:Analytics)

elasticsearchmachine avatar Mar 11 '24 08:03 elasticsearchmachine

I've assigned this one to myself, it's caused by https://github.com/elastic/elasticsearch/pull/106140.

joegallo avatar Mar 11 '24 16:03 joegallo

https://gradle-enterprise.elastic.co/s/yocwlc3kri3jg/tests/task/:docs:yamlRestTest/details/org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT/test%20%7Byaml=reference%2Fsql%2Fendpoints%2Frest%2Fline_574%7D

joegallo avatar Mar 11 '24 16:03 joegallo

I had unmuted these tests on https://github.com/elastic/elasticsearch/pull/106140, they were originally muted via https://github.com/elastic/elasticsearch/pull/75237. I have re-muted them via https://github.com/elastic/elasticsearch/pull/106196.

Since https://github.com/elastic/elasticsearch/pull/94191 has been merged, the async search feature is being reset when we clean up the cluster between rest tests. However, I think the .async-search index is being re-created as a race condition because of the async nature of writing to that index (or something 🤷).

So the flow in the cleanup code ends up being:

  1. The test runs, it will create an async-search
  2. The async-search feature is reset, the .async-search index will be deleted if it exists
  3. The async-search is actually created and the .async-search index is created
  4. We get to wipeAllIndices(...) and it complains that .async-search is a system index

/cc @williamrandolph since we talked about exactly this sort of possibility

joegallo avatar Mar 11 '24 17:03 joegallo

The failures reproduce really easily, btw.

joegallo@simulacron:~/Code/elastic/elasticsearch $ git reset --hard upstream/main
HEAD is now at 0dd6ce2df6c Delete DownsampleClusterDisruptionIT (#106225)
joegallo@simulacron:~/Code/elastic/elasticsearch $ git log --oneline -1
0dd6ce2df6c (HEAD -> main, upstream/main, upstream/HEAD) Delete DownsampleClusterDisruptionIT (#106225)
joegallo@simulacron:~/Code/elastic/elasticsearch $ git revert --no-edit 8be3fb229d8c29d7a3931ffcb5e03e0bb0476c0f
[main 27c07af19e4] Revert "Mute failing snippet tests (#106196)"
 Date: Tue Mar 12 09:52:24 2024 -0400
 1 file changed, 6 deletions(-)

And then it's just a matter of:

./gradlew ':docs:yamlRestTest' --tests "org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT.test {yaml=reference/sql/endpoints/rest/*}"

That doesn't fail 100% of the time, but it's pretty reliable (it failed 2 out of the 4 times I tried this on my laptop just now).

joegallo avatar Mar 12 '24 13:03 joegallo

I've unassigned myself from this issue because this seems to be a legitimate issue and none of it's actually my area.

joegallo avatar Mar 12 '24 14:03 joegallo

I suppose something like the following diff would square this away, but that seems contrary to the point of https://github.com/elastic/elasticsearch/pull/94191:

diff --git a/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java b/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
index 307daddd17c..24f6424e71e 100644
--- a/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
+++ b/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
@@ -1109,7 +1109,7 @@ public abstract class ESRestTestCase extends ESTestCase {
         try {
             // remove all indices except some history indices which can pop up after deleting all data streams but shouldn't interfere
             final List<String> indexPatterns = new ArrayList<>(
-                List.of("*", "-.ds-ilm-history-*", "-.ds-.slm-history-*", "-.ds-.watcher-history-*")
+                List.of("*", "-.async-search", "-.ds-ilm-history-*", "-.ds-.slm-history-*", "-.ds-.watcher-history-*")
             );
             if (preserveSecurityIndices) {
                 indexPatterns.add("-.security-*");

joegallo avatar Mar 12 '24 14:03 joegallo

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine avatar Mar 12 '24 16:03 elasticsearchmachine

Given the similarity to #104013, I'm labeling this one as low-risk instead of blocker (that is, labeled the same as that issue already is).

joegallo avatar Mar 12 '24 16:03 joegallo

I think this one is similar to this:

REPRODUCE WITH: ./gradlew ':docs:yamlRestTest' --tests "org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT.test {yaml=reference/esql/esql-rest/line_297}" -Dtests.seed=53ABCE485C630DC4 -Dtests.locale=hu -Dtests.timezone=Asia/Irkutsk -Druntime.java=21

nik9000 avatar Mar 13 '24 14:03 nik9000

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticsearchmachine avatar Mar 13 '24 14:03 elasticsearchmachine

Since this has to do with system indices and those being cleaned up, I am gonna tag core infra as well.

I don't know what the focus would be for search, since .async_search is a system index, its management, etc. should be pretty transparent to us.

@williamrandolph would we have to add some logic on waiting for tasks or something in the async search cleanup?

benwtrent avatar Mar 13 '24 14:03 benwtrent

Also failing on 8.13 https://gradle-enterprise.elastic.co/s/35culdxu6ttss/tests/:docs:yamlRestTest/org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT/test%20%7Byaml=reference%2Fesql%2Fesql-rest%2Fline_297%7D

davidkyle avatar Mar 18 '24 15:03 davidkyle

It would be ideal if I could implement some cleanup logic by overriding SystemIndexPlugin#cleanUpFeature in AsyncResultsIndexPlugin. I'll take a look at doing that.

williamrandolph avatar Mar 18 '24 19:03 williamrandolph

Pinging @elastic/es-search-foundations (Team:Search Foundations)

elasticsearchmachine avatar Jul 17 '24 18:07 elasticsearchmachine

I dug into this a bit. What I found confirms @joegallo hypothesis:

[2024-08-08T15:27:14,717][INFO ][o.e.x.a.AsyncResultsIndexPlugin] [node-0] cleanUpFeature started
[2024-08-08T15:27:14,718][INFO ][o.e.x.a.AsyncResultsIndexPlugin] [node-0] cleanUpFeature done
[2024-08-08T15:27:14,786][INFO ][o.e.a.a.i.c.AutoCreateAction] [node-0] AutoCreateAction#execute .async-search
[2024-08-08T15:27:14,786][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] executing IndexCreationTask for [CreateIndexClusterStateUpdateRequest{cause='auto(bulk api)', index='.async-search', dataStreamName='null', providedName='.async-search', recoverFrom=null, resizeType=null, copySettings=false, settings={"index.auto_expand_replicas":"0-1","index.codec":"best_compression","index.hidden":"true","index.number_of_shards":"1"}, aliases=[], blocks=[], waitForActiveShards=ALL, systemDataStreamDescriptor=null, matchingTemplate=null}] against cluster state version [54]
[2024-08-08T15:27:14,786][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] applying create index request for system index [.async-search] matching pattern [.async-search*]
[2024-08-08T15:27:14,790][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] [.async-search] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]

What is happening is that the snippet starts a async query, but does not wait for it (the test actually replaces the wait_for_completion_timeout with 0). The test suite proceeds to cleanup, calling SystemIndexPlugin#cleanUpFeature; this would cleanup .async-search (and it does, in some rare cases, when the query completes faster than we reach cleanUpFeature).

For example, these are the logs from a successful execution:

[2024-08-08T16:48:29,600][INFO ][o.e.a.a.i.c.AutoCreateAction] [node-0] AutoCreateAction#execute .async-search
[2024-08-08T16:48:29,600][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] executing IndexCreationTask for [CreateIndexClusterStateUpdateRequest{cause='auto(bulk api)', index='.async-search', dataStreamName='null', providedName='.async-search', recoverFrom=null, resizeType=null, copySettings=false, settings={"index.auto_expand_replicas":"0-1","index.codec":"best_compression","index.hidden":"true","index.number_of_shards":"1"}, aliases=[], blocks=[], waitForActiveShards=ALL, systemDataStreamDescriptor=null, matchingTemplate=null}] against cluster state version [52]
[2024-08-08T16:48:29,600][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] applying create index request for system index [.async-search] matching pattern [.async-search*]
[2024-08-08T16:48:29,604][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] [.async-search] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]
[2024-08-08T16:48:29,605][INFO ][o.e.c.r.a.AllocationService] [node-0] updating number_of_replicas to [0] for indices [.async-search]
[2024-08-08T16:48:31,581][INFO ][o.e.x.m.MachineLearning  ] [node-0] Starting machine learning feature reset
[2024-08-08T16:48:31,584][INFO ][o.e.x.a.AsyncResultsIndexPlugin] [node-0] cleanUpFeature started
[2024-08-08T16:48:31,586][INFO ][o.e.x.a.AsyncResultsIndexPlugin] [node-0] cleanUpFeature done
[2024-08-08T16:48:31,612][INFO ][o.e.c.m.MetadataDeleteIndexService] [node-0] [.async-search/922JLz6jRga6h4Xa2euk4w] deleting index
[2024-08-08T16:48:31,732][INFO ][o.e.x.m.MachineLearning  ] [node-0] Finished machine learning feature reset

Notice how .async-search is creater before AsyncResultsIndexPlugin#cleanUpFeature is called here.

This means that this has little to do with how system indices are being cleaned up; it's more a test issue -- these tests launch a query, then forget about it, creating a race between the system index creation (when the query eventually completes) and cleanup. Overriding AsyncResultsIndexPlugin#cleanUpFeature does no good either -- the test fails when creation happens after cleanUpFeature and before wipeAllIndices (which generates the deprecation warning, as it will find a system index to delete).

ldematte avatar Aug 08 '24 14:08 ldematte

I see 4 ways of fixing this:

  1. what joe suggested: https://github.com/elastic/elasticsearch/issues/106158#issuecomment-1991738483 do not delete .async-search. Not my preferred option TBH
  2. in all async search tests, know that a deprecation warning can be generated and add TEST[warning:] (see docs). Not sure about this either, the deprecation warning could eventually become an error.
  3. Wait for tasks completion? Here https://github.com/elastic/elasticsearch/blob/9496fa37469d543d73e558bdcf0d710d63d5833b/docs/src/yamlRestTest/java/org/elasticsearch/smoketest/DocsClientYamlTestSuiteIT.java#L144 we already wait for ML tests.
  4. Skip execution on all these tests (transform mute to // TEST[skip:])

ldematte avatar Aug 08 '24 14:08 ldematte

In any case, I think this is more an issue with how the test is written, not a system index problem, so I'm going to untag Core/Infra. But let me know if you think this is not right and still need our input.

ldematte avatar Aug 08 '24 15:08 ldematte