elasticsearch
elasticsearch copied to clipboard
[CI] TransformIndexerStateTests testStopAtCheckpoint failing
I wasn't able to reproduce it locally (tried with -Dtests.iters=500
).
Build scan: https://gradle-enterprise.elastic.co/s/3tgciezc7km42/tests/:x-pack:plugin:transform:test/org.elasticsearch.xpack.transform.transforms.TransformIndexerStateTests/testStopAtCheckpoint
Reproduction line:
./gradlew ':x-pack:plugin:transform:test' --tests "org.elasticsearch.xpack.transform.transforms.TransformIndexerStateTests.testStopAtCheckpoint" -Dtests.seed=F796A9736602D273 -Dtests.locale=nb-NO -Dtests.timezone=Antarctica/Palmer -Druntime.java=18
Applicable branches: 8.5
Reproduces locally?: No
Failure history: https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.transform.transforms.TransformIndexerStateTests&tests.test=testStopAtCheckpoint
Failure excerpt:
java.lang.AssertionError:
Expected: <5>
but: was <4>
at __randomizedtesting.SeedInfo.seed([F796A9736602D273:F10C883B5F71A962]:0)
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
at org.junit.Assert.assertThat(Assert.java:956)
at org.junit.Assert.assertThat(Assert.java:923)
at org.elasticsearch.xpack.transform.transforms.TransformIndexerStateTests.testStopAtCheckpoint(TransformIndexerStateTests.java:442)
at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
at java.lang.reflect.Method.invoke(Method.java:577)
at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
at java.lang.Thread.run(Thread.java:833)
Pinging @elastic/ml-core (Team:ML)
Another failure https://gradle-enterprise.elastic.co/s/ouamdof7kcboi/tests/:x-pack:plugin:transform:test/org.elasticsearch.xpack.transform.transforms.TransformIndexerStateTests/testStopAtCheckpoint
Another one with the same Expected: <5> but: was <4>
error
https://gradle-enterprise.elastic.co/s/jhclh54htqkwc/tests/task/:x-pack:plugin:transform:test/details/org.elasticsearch.xpack.transform.transforms.TransformIndexerStateTests/testStopAtCheckpoint?top-execution=1
Assigning low-risk
as the test isn't muted and fails incredibly rarely.
Another failure: https://gradle-enterprise.elastic.co/s/kgutmpotorwrc/tests/task/:x-pack:plugin:transform:test/details/org.elasticsearch.xpack.transform.transforms.TransformIndexerStateTests/testStopAtCheckpoint?top-execution=1
Recently, this has failed every other day and also on main
: https://gradle-enterprise.elastic.co/scans/tests?search.buildOutcome=failure&search.timeZoneId=Europe/Berlin&tests.container=org.elasticsearch.xpack.transform.transforms.TransformIndexerStateTests&tests.test=testStopAtCheckpoint.
Still see random failures, not on Windows https://gradle-enterprise.elastic.co/s/5dn7otmudeko6
Failed again on the intake run: https://gradle-enterprise.elastic.co/s/nf47bzd4hoclk
Another one at elasticsearch / periodic / platform-support / main / opensuse-leap-15 / platform-support-unix
I'm not sure if this is the issue, but this looks like it can be an issue: doNextSearch
- Thread 1 (test thread) runs maybeTriggerAsyncJob, this starts Thread 2.
- If Thread 2 runs
doNextSearch
before Thread 1 runs createAwaitForSearchLatch, then Thread 2 will start Thread 3. - If Thread 3 starts reading from doSaveState while Thread 1 starts updating setStopAtCheckpoint, then we create a situation where we successfully run the listener but we do not get a chance to increment the test counter in the test's doSaveState
This would at least explain why we pass validation when we check that all 5 listeners ran here, but then we fail validation checking that the count incremented here
Some options:
- move the
createAwaitForSearchLatch
up such that Thread 1 blocks Thread 2 from starting Thread 3 until after Thread 1 finishes updating thesaveStateListeners
- ~~remove the validation checking the counter, since we already check if the listener ran during a
doSaveState
call which is what the first validation is doing~~- other tests are using this for different reasons, so it's tough (for me) to say if we should remove this or simplify it or anything, for now we can prioritize fixing the inconsistencies
~~I'll timebox looking through other possibilities and options~~
It's also possible saveStateListenerCallCount will be read/write in a non-thread-safe way, so we can also swap that out for an AtomicInteger.