elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

[CI] TransformIndexerStateTests testStopAtCheckpoint failing

Open gmarouli opened this issue 2 years ago • 9 comments

I wasn't able to reproduce it locally (tried with -Dtests.iters=500).

Build scan: https://gradle-enterprise.elastic.co/s/3tgciezc7km42/tests/:x-pack:plugin:transform:test/org.elasticsearch.xpack.transform.transforms.TransformIndexerStateTests/testStopAtCheckpoint

Reproduction line: ./gradlew ':x-pack:plugin:transform:test' --tests "org.elasticsearch.xpack.transform.transforms.TransformIndexerStateTests.testStopAtCheckpoint" -Dtests.seed=F796A9736602D273 -Dtests.locale=nb-NO -Dtests.timezone=Antarctica/Palmer -Druntime.java=18

Applicable branches: 8.5

Reproduces locally?: No

Failure history: https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.transform.transforms.TransformIndexerStateTests&tests.test=testStopAtCheckpoint

Failure excerpt:

java.lang.AssertionError: 
Expected: <5>
     but: was <4>

  at __randomizedtesting.SeedInfo.seed([F796A9736602D273:F10C883B5F71A962]:0)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
  at org.junit.Assert.assertThat(Assert.java:956)
  at org.junit.Assert.assertThat(Assert.java:923)
  at org.elasticsearch.xpack.transform.transforms.TransformIndexerStateTests.testStopAtCheckpoint(TransformIndexerStateTests.java:442)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
  at java.lang.reflect.Method.invoke(Method.java:577)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:833)

gmarouli avatar Sep 30 '22 08:09 gmarouli

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine avatar Sep 30 '22 08:09 elasticsearchmachine

Another failure https://gradle-enterprise.elastic.co/s/ouamdof7kcboi/tests/:x-pack:plugin:transform:test/org.elasticsearch.xpack.transform.transforms.TransformIndexerStateTests/testStopAtCheckpoint

davidkyle avatar Feb 13 '23 13:02 davidkyle

Another one with the same Expected: <5> but: was <4> error

https://gradle-enterprise.elastic.co/s/jhclh54htqkwc/tests/task/:x-pack:plugin:transform:test/details/org.elasticsearch.xpack.transform.transforms.TransformIndexerStateTests/testStopAtCheckpoint?top-execution=1

davidkyle avatar Sep 11 '23 09:09 davidkyle

Assigning low-risk as the test isn't muted and fails incredibly rarely.

droberts195 avatar Oct 10 '23 09:10 droberts195

Another failure: https://gradle-enterprise.elastic.co/s/kgutmpotorwrc/tests/task/:x-pack:plugin:transform:test/details/org.elasticsearch.xpack.transform.transforms.TransformIndexerStateTests/testStopAtCheckpoint?top-execution=1

gmarouli avatar Oct 18 '23 08:10 gmarouli

Recently, this has failed every other day and also on main: https://gradle-enterprise.elastic.co/scans/tests?search.buildOutcome=failure&search.timeZoneId=Europe/Berlin&tests.container=org.elasticsearch.xpack.transform.transforms.TransformIndexerStateTests&tests.test=testStopAtCheckpoint.

pxsalehi avatar Oct 24 '23 14:10 pxsalehi

Still see random failures, not on Windows https://gradle-enterprise.elastic.co/s/5dn7otmudeko6

rjernst avatar Feb 05 '24 19:02 rjernst

Failed again on the intake run: https://gradle-enterprise.elastic.co/s/nf47bzd4hoclk

slobodanadamovic avatar Feb 07 '24 09:02 slobodanadamovic

I'm not sure if this is the issue, but this looks like it can be an issue: doNextSearch

  • Thread 1 (test thread) runs maybeTriggerAsyncJob, this starts Thread 2.
  • If Thread 2 runs doNextSearch before Thread 1 runs createAwaitForSearchLatch, then Thread 2 will start Thread 3.
  • If Thread 3 starts reading from doSaveState while Thread 1 starts updating setStopAtCheckpoint, then we create a situation where we successfully run the listener but we do not get a chance to increment the test counter in the test's doSaveState

This would at least explain why we pass validation when we check that all 5 listeners ran here, but then we fail validation checking that the count incremented here

Some options:

  • move the createAwaitForSearchLatch up such that Thread 1 blocks Thread 2 from starting Thread 3 until after Thread 1 finishes updating the saveStateListeners
  • ~~remove the validation checking the counter, since we already check if the listener ran during a doSaveState call which is what the first validation is doing~~
    • other tests are using this for different reasons, so it's tough (for me) to say if we should remove this or simplify it or anything, for now we can prioritize fixing the inconsistencies

~~I'll timebox looking through other possibilities and options~~

It's also possible saveStateListenerCallCount will be read/write in a non-thread-safe way, so we can also swap that out for an AtomicInteger.

prwhelan avatar Feb 20 '24 14:02 prwhelan