elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

[CI] IndicesSegmentsRestCancellationIT testCatSegmentsRestCancellation failing

Open fcofdez opened this issue 3 years ago • 6 comments
trafficstars

Build scan: https://gradle-enterprise.elastic.co/s/2o4ugc7mhz6si/tests/:qa:smoke-test-http:javaRestTest/org.elasticsearch.http.IndicesSegmentsRestCancellationIT/testCatSegmentsRestCancellation

Reproduction line: ./gradlew ':qa:smoke-test-http:javaRestTest' --tests "org.elasticsearch.http.IndicesSegmentsRestCancellationIT.testCatSegmentsRestCancellation" -Dtests.seed=8D552D969E2896B -Dtests.locale=ar-IQ -Dtests.timezone=MIT -Druntime.java=8

Applicable branches: 7.17

Reproduces locally?: Didn't try

Failure history: https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.http.IndicesSegmentsRestCancellationIT&tests.test=testCatSegmentsRestCancellation

Failure excerpt:

junit.framework.AssertionFailedError: task 83/indices:monitor/segments[n] not cancelled

  at __randomizedtesting.SeedInfo.seed([8D552D969E2896B:5FA268A1A1677E70]:0)
  at junit.framework.Assert.fail(Assert.java:57)
  at junit.framework.Assert.assertTrue(Assert.java:22)
  at junit.framework.TestCase.assertTrue(TestCase.java:192)
  at org.elasticsearch.test.TaskAssertions.lambda$assertAllCancellableTasksAreCancelled$2(TaskAssertions.java:55)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1141)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1114)
  at org.elasticsearch.test.TaskAssertions.assertAllCancellableTasksAreCancelled(TaskAssertions.java:47)
  at org.elasticsearch.http.BlockedSearcherRestCancellationTestCase.runTest(BlockedSearcherRestCancellationTestCase.java:111)
  at org.elasticsearch.http.IndicesSegmentsRestCancellationIT.testCatSegmentsRestCancellation(IndicesSegmentsRestCancellationIT.java:21)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
  at java.lang.Thread.run(Thread.java:750)

fcofdez avatar Jun 30 '22 08:06 fcofdez

Pinging @elastic/es-distributed (Team:Distributed)

elasticmachine avatar Jun 30 '22 08:06 elasticmachine

I've looked into this. Unfortunately no clue why it fails. That being said, this is nothing specific to 7.17. The test has failed only 3 times since 1 Jan (~2800 runs!), and that's across all branches. I haven't been able to reproduce it.

It seems both tests (indices segment and cat segment) in that suite will result in the same request type (IndicesSegmentsRequest), but only one of those tests has failed!

Since both tests are basically doing the same thing, my best guess is either there is some interference between the two (we could avoid sharing the cluster and use a ESIntegTestCase.Scope.TEST, or we simplify the assertions that the test makes. By the latter I mean, e.g. here if we have already asserted that we have received the cancellation exception, do we still need to reach into the internals of the node and verify the task status? (CC: @DaveCTurner and @fcofdez)

pxsalehi avatar Jul 27 '22 14:07 pxsalehi

FWIW I ran tens of thousands of iterations of the whole IndicesSegmentsRestCancellationIT suite on a CPU-stressed machine (usually good for triggering rare concurrency issues) without a single problem. I wonder if we're receiving a segments request from an external influence (e.g. another test suite running concurrently).

DaveCTurner avatar Jul 29 '22 10:07 DaveCTurner

Yes. I also tried 5k-6k runs of the suite on two branches, with no failure.

pxsalehi avatar Jul 29 '22 10:07 pxsalehi

I double checked this with @original-brownbear and he agrees that there is not much to do for this failure other than increasing the timeout, which was done a few weeks ago. We can just leave this open until next week and then close it if there are no new failures.

pxsalehi avatar Sep 13 '22 15:09 pxsalehi

This failed again on 7.17. https://gradle-enterprise.elastic.co/s/itisybsxku6iy/tests/:qa:smoke-test-http:javaRestTest/org.elasticsearch.http.IndicesSegmentsRestCancellationIT/testCatSegmentsRestCancellation?top-execution=1

pxsalehi avatar Sep 20 '22 09:09 pxsalehi

Re-appeared in 8.9 at https://gradle-enterprise.elastic.co/s/szrl2yl34rtz2

kingherc avatar Jul 03 '23 16:07 kingherc

This issue is from one year ago! I think a new issue would be more fitting. The code state one year apart might be very different, and re-opening this instead of a new one might just be misleading when investigating the issue.

pxsalehi avatar Jul 04 '23 07:07 pxsalehi

I believe the error is so similar, that the history might be relevant. I see commits referenced, that might have been close to the root cause, but not exactly. So, I'd leave this to the judgement of whoever handles this ticket. They can close this and open a new one.

kingherc avatar Jul 04 '23 07:07 kingherc

New failure today on main: https://gradle-enterprise.elastic.co/s/zlqhwimhv7cju/tests/:qa:smoke-test-http:javaRestTest/org.elasticsearch.http.IndicesSegmentsRestCancellationIT/testCatSegmentsRestCancellation?top-execution=1

junit.framework.AssertionFailedError: task 56/indices:monitor/segments[n] not cancelled	
at __randomizedtesting.SeedInfo.seed([52231FF4B54560F8:554258C7DC097E3]:0)	
at junit.framework.Assert.fail(Assert.java:57)	
at junit.framework.Assert.assertTrue(Assert.java:22)	
at junit.framework.TestCase.assertTrue(TestCase.java:192)	
at org.elasticsearch.test.TaskAssertions.lambda$assertAllCancellableTasksAreCancelled$2(TaskAssertions.java:74)	
at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1173)	
at org.elasticsearch.test.TaskAssertions.assertAllCancellableTasksAreCancelled(TaskAssertions.java:65)	
at org.elasticsearch.http.BlockedSearcherRestCancellationTestCase.runTest(BlockedSearcherRestCancellationTestCase.java:111)	
at org.elasticsearch.http.IndicesSegmentsRestCancellationIT.testCatSegmentsRestCancellation(IndicesSegmentsRestCancellationIT.java:21)	

And 4 more in the past 7 days https://gradle-enterprise.elastic.co/scans/tests?search.startTimeMax=1689086885271&search.startTimeMin=1688418000000&search.timeZoneId=Europe/Bucharest&tests.container=org.elasticsearch.http.IndicesSegmentsRestCancellationIT&tests.test=testCatSegmentsRestCancellation on various branches.

albertzaharovits avatar Jul 11 '23 14:07 albertzaharovits

According to CI stats this test hasn't failed for 5+ months. Closing for now. Please re-open if new failures get encountered

volodk85 avatar Feb 09 '24 01:02 volodk85

Happened again, re-opening.

https://gradle-enterprise.elastic.co/s/tn5qk67ifyzmq/tests/task/:qa:smoke-test-http:javaRestTest/details/org.elasticsearch.http.IndicesSegmentsRestCancellationIT/testCatSegmentsRestCancellation

alex-spies avatar Mar 26 '24 13:03 alex-spies

Here's another one: https://gradle-enterprise.elastic.co/s/gm7aj446luzyc/tests/task/:qa:smoke-test-http:javaRestTest/details/org.elasticsearch.http.IndicesSegmentsRestCancellationIT/testCatSegmentsRestCancellation?top-execution=1

cbuescher avatar May 17 '24 13:05 cbuescher

Another one: https://gradle-enterprise.elastic.co/s/eaxoxqqxrjrle/tests/:qa:smoke-test-http:javaRestTest/org.elasticsearch.http.IndicesSegmentsRestCancellationIT/testCatSegmentsRestCancellation

nielsbauman avatar May 24 '24 08:05 nielsbauman

Spent more time figuring out whats happening. Tried few hypothesis, none worked:

  • We dont inject semaphores to all nodes so tasks run successfully before we call cancellation. That implies flaky startup of internalCluster nodes, I wasnt able to start a cluster with any flakiness. Even single node injection is enough to pass test.
  • Task registration and cancellation happen in wrong order. I tried to inject random delays when we register task, ban/unban, unregister. Test always pass. We do 2 step assertions, making sure all tasks cancelled and all tasks finished, these 2 steps catch ordering problems.

I noticed in one gradle scan that we failed on final assertion to finish tasks, which means task somehow appear after we successfully cancelled and asserted that everything is cancelled. I think @DaveCTurner comment about external interference is most reasonable. I opened PR #110680 that should notify of possibility of external influence.

mhl-b avatar Jul 09 '24 23:07 mhl-b