elasticsearch
elasticsearch copied to clipboard
[CI] SimpleThreadPoolIT testThreadPoolMetrics failing
Build scan: https://gradle-enterprise.elastic.co/s/yebknro47qsu4/tests/:server:internalClusterTest/org.elasticsearch.threadpool.SimpleThreadPoolIT/testThreadPoolMetrics
Reproduction line:
./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.threadpool.SimpleThreadPoolIT.testThreadPoolMetrics" -Dtests.seed=8FA8BE6EA7422389 -Dtests.locale=ja-JP-u-ca-japanese-x-lvariant-JP -Dtests.timezone=Europe/Zaporozhye -Druntime.java=21
Applicable branches: main
Reproduces locally?: Didn't try
Failure history:
Failure dashboard for org.elasticsearch.threadpool.SimpleThreadPoolIT#testThreadPoolMetrics
Failure excerpt:
java.lang.AssertionError:
Expected: map containing ["search.threads.queue.size"->iterable containing [a value equal to or greater than <1L>]]
but: map was [<search.threads.active.current=[0]>, <search.threads.completed.total=[476]>, <search.threads.count.current=[7]>, <search.threads.largest.current=[7]>, <search.threads.queue.size=[0]>]
at __randomizedtesting.SeedInfo.seed([8FA8BE6EA7422389:F9DF9EC5C9353D17]:0)
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
at org.elasticsearch.test.ESTestCase.assertThat(ESTestCase.java:2150)
at org.elasticsearch.threadpool.SimpleThreadPoolIT.lambda$testThreadPoolMetrics$4(SimpleThreadPoolIT.java:188)
at java.util.TreeMap.forEach(TreeMap.java:1317)
at java.util.Collections$UnmodifiableMap.forEach(Collections.java:1707)
at org.elasticsearch.threadpool.SimpleThreadPoolIT.lambda$testThreadPoolMetrics$5(SimpleThreadPoolIT.java:187)
at java.lang.Iterable.forEach(Iterable.java:75)
at org.elasticsearch.threadpool.SimpleThreadPoolIT.testThreadPoolMetrics(SimpleThreadPoolIT.java:161)
at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
at java.lang.reflect.Method.invoke(Method.java:580)
at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
at java.lang.Thread.run(Thread.java:1583)
Pinging @elastic/es-core-infra (Team:Core/Infra)
this test is failing because of search.threads.queue.size
being different in threadpool stats and metric.
[0006-05-03T15:16:51,033][INFO ][o.e.t.SimpleThreadPoolIT ] [testThreadPoolMetrics] Stats of `search`: {search.threads.active.current=0, search.threads.completed.total=475, search.threads.count.current=7, search.threads.largest.current=7, search.threads.queue.size=1} |
-- | --
| [0006-05-03T15:16:51,033][INFO ][o.e.t.SimpleThreadPoolIT ] [testThreadPoolMetrics] Measurements of `search`: {search.threads.active.current=[0], search.threads.completed.total=[476], search.threads.count.current=[7], search.threads.largest.current=[7], search.threads.queue.size=[0]}
we are waiting for the threadpool stats to report that there is no active thread. A line later we collect metric measurments. I suspect that in that moment there might be a new thread submitted hence the threadpool stat is reporting 0, when the metric mesurment is reporting 1.
@mosche wdyt? you worked on hardening this test before
I wonder if there is a way to reliably and gently shutdown a threadpool or EsIntegTest so that we have a 'frozen' es node that we can assert about
I see a few options:
- limit the test to thread pools for which we don't expect scheduled background threads to remove the indeterminism.
- extend the check in line 155 to make sure active = 0 & queue = 0, that way anything gathered in measurements later can only be greater or equals ... but then it's somehow pointless to even check active and queued and we could simply remove these.
- alternatively we could block & completely fill the thread pool the same way you've done in the kibana thread pool test. though that almost seems to be unnecessarily complex for what we'd like to test
test is failing due to threadpool stats vs apm metric discrepancy. most likely a timing issue assesing the risk to low
Failed again https://gradle-enterprise.elastic.co/s/prkn746ltk5zw/tests/task/:server:internalClusterTest/details/org.elasticsearch.threadpool.SimpleThreadPoolIT/testThreadPoolMetrics?top-execution=1
This has been muted on branch main
Mute Reasons:
- [main] 2 failures in test testThreadPoolMetrics (1.0% fail rate in 201 executions)
Build Scans:
This has been muted on branch 8.x
Mute Reasons:
- [8.x] 18 failures in test testThreadPoolMetrics (2.5% fail rate in 719 executions)
- [8.x] 3 failures in step oraclelinux-8_platform-support-unix (15.0% fail rate in 20 executions)
- [8.x] 2 failures in step ubuntu-2004_platform-support-unix (11.1% fail rate in 18 executions)
- [8.x] 2 failures in step debian-11_platform-support-unix (11.1% fail rate in 18 executions)
- [8.x] 3 failures in step part1 (4.3% fail rate in 70 executions)
- [8.x] 2 failures in step part-1 (2.1% fail rate in 97 executions)
- [8.x] 10 failures in pipeline elasticsearch-periodic-platform-support (50.0% fail rate in 20 executions)
- [8.x] 2 failures in pipeline elasticsearch-periodic (10.5% fail rate in 19 executions)
- [8.x] 3 failures in pipeline elasticsearch-intake (4.3% fail rate in 70 executions)
- [8.x] 2 failures in pipeline elasticsearch-pull-request (2.1% fail rate in 94 executions)
Build Scans:
- elasticsearch-periodic-platform-support #4345 / oraclelinux-8_platform-support-unix
- elasticsearch-periodic-platform-support #4341 / oraclelinux-7_platform-support-unix
- elasticsearch-periodic-platform-support #4337 / rocky-9_platform-support-unix
- elasticsearch-intake #10937 / part1
- elasticsearch-periodic-platform-support #4333 / sles-12_platform-support-unix
- elasticsearch-periodic-platform-support #4325 / oraclelinux-8_platform-support-unix
- elasticsearch-periodic-platform-support #4317 / debian-11_platform-support-unix
- elasticsearch-periodic-platform-support #4309 / oraclelinux-8_platform-support-unix
- elasticsearch-intake #10778 / part1
- elasticsearch-periodic-platform-support #4305 / debian-11_platform-support-unix
This has been muted on branch 8.x
Mute Reasons:
- [8.x] 10 failures in test testThreadPoolMetrics (1.9% fail rate in 537 executions)
- [8.x] 2 failures in step part1 (2.1% fail rate in 96 executions)
- [8.x] 2 failures in step oraclelinux-8_platform-support-unix (18.2% fail rate in 11 executions)
- [8.x] 7 failures in pipeline elasticsearch-periodic-platform-support (53.8% fail rate in 13 executions)
- [8.x] 2 failures in pipeline elasticsearch-intake (2.1% fail rate in 96 executions)
Build Scans:
- elasticsearch-periodic-platform-support #4402 / rhel-7_platform-support-unix
- elasticsearch-periodic-platform-support #4385 / almalinux-8-aarch64_checkpart1_platform-support-arm
- elasticsearch-intake #11090 / part1
- elasticsearch-pull-request #35852 / part-1
- elasticsearch-periodic-platform-support #4345 / oraclelinux-8_platform-support-unix
- elasticsearch-periodic-platform-support #4341 / oraclelinux-7_platform-support-unix
- elasticsearch-periodic-platform-support #4337 / rocky-9_platform-support-unix
- elasticsearch-intake #10937 / part1
- elasticsearch-periodic-platform-support #4333 / sles-12_platform-support-unix
- elasticsearch-periodic-platform-support #4325 / oraclelinux-8_platform-support-unix