elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

[CI] ShrinkActionIT testAutomaticRetryFailedShrinkAction failing

Open fcofdez opened this issue 3 years ago • 9 comments

Build scan: https://gradle-enterprise.elastic.co/s/yicchv3754amg/tests/:x-pack:plugin:ilm:qa:multi-node:javaRestTest/org.elasticsearch.xpack.ilm.actions.ShrinkActionIT/testAutomaticRetryFailedShrinkAction

Reproduction line: ./gradlew ':x-pack:plugin:ilm:qa:multi-node:javaRestTest' --tests "org.elasticsearch.xpack.ilm.actions.ShrinkActionIT.testAutomaticRetryFailedShrinkAction" -Dtests.seed=D966D2304D0930DC -Dtests.locale=pl-PL -Dtests.timezone=Europe/Tallinn -Druntime.java=11

Applicable branches: master

Reproduces locally?: Most of the time we can achieve 1 or 2 failures if we run it with -Dtests.iters=100.

Failure history: https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.ilm.actions.ShrinkActionIT&tests.test=testAutomaticRetryFailedShrinkAction

Failure excerpt:

java.lang.AssertionError: (No message provided)

  at __randomizedtesting.SeedInfo.seed([D966D2304D0930DC:1F9215989061A05]:0)
  at org.junit.Assert.fail(Assert.java:86)
  at org.junit.Assert.assertTrue(Assert.java:41)
  at org.junit.Assert.assertTrue(Assert.java:52)
  at org.elasticsearch.xpack.ilm.actions.ShrinkActionIT.lambda$testAutomaticRetryFailedShrinkAction$17(ShrinkActionIT.java:295)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:985)
  at org.elasticsearch.xpack.ilm.actions.ShrinkActionIT.testAutomaticRetryFailedShrinkAction(ShrinkActionIT.java:295)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:566)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
  at java.lang.Thread.run(Thread.java:829)

fcofdez avatar Sep 29 '21 14:09 fcofdez

More failures: https://gradle-enterprise.elastic.co/s/432pstypb7hz2/tests/:x-pack:plugin:ilm:qa:multi-node:javaRestTest/org.elasticsearch.xpack.ilm.actions.ShrinkActionIT/testAutomaticRetryFailedShrinkAction?top-execution=1

Muted in master (d2448bfd57b) and 7.x (ca951ee443b)

ywelsch avatar Oct 20 '21 08:10 ywelsch

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine avatar Jul 28 '22 13:07 elasticsearchmachine

From a quick look it looks like the test is failing because the shrink index name is null. What the test does is:

  • Set-up a policy with the wrong shard number
  • Then it fixes the policy
  • Then it checks if the shrink step is moving forward

The shrink index name being null could be an actual issue or just a timing issue, I will investigate further to figure out what's going on.

gmarouli avatar Jul 28 '22 13:07 gmarouli

Based on the logging ILM explained that is being logged:

[2022-07-28T15:45:56,690][INFO ][o.e.x.TimeSeriesRestDriver] [testAutomaticRetryFailedShrinkAction] --> index index-xsvudpmsaf, explain {index-xsvudpmsaf={index=index-xsvudpmsaf, managed=true, policy=policy-omzmg, index_creation_date_millis=1659019511781, time_since_index_creation=44.9s, lifecycle_date_millis=1659019511781, age=44.9s, phase=warm, phase_time_millis=1659019517402, action=shrink, action_time_millis=1659019519622, step=check-target-shards-count, step_time_millis=1659019524259, step_info={message=lifecycle action of policy [policy-omzmg] for index [index-xsvudpmsaf] cannot make progress because the target shards count [6] must be a factor of the source index's shards count [4]}, phase_execution={policy=policy-omzmg, phase_definition={min_age=0ms, actions={shrink={number_of_shards=1}}}, version=2, modified_date_in_millis=1659019525839}}}

I believe this is not a timing issue, because if I understand correctly the above, the policy has been updated:

phase_execution={policy=policy-omzmg, phase_definition={min_age=0ms, actions={shrink={number_of_shards=1}}}, version=2, modified_date_in_millis=1659019525839}}

And yet still the step info still says:

step_info={message=lifecycle action of policy [policy-omzmg] for index [index-xsvudpmsaf] cannot make progress because the target shards count [6] must be a factor of the source index's shards count [4]}

gmarouli avatar Jul 28 '22 14:07 gmarouli

  • First attempt with invalid target shard count:
[2022-07-29T22:57:19,024][INFO ][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] [index-oewyaxaocl] State changed for index
[2022-07-29T22:57:19,024][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] [index-oewyaxaocl] retrieved current step key: {"phase":"warm","action":"shrink","name":"check-target-shards-count"}
[2022-07-29T22:57:19,042][TRACE][o.e.x.i.PolicyStepsRegistry] [javaRestTest-0] parsed steps for policy [policy-GeICC] in phase [warm], definition: [{"policy":"policy-GeICC","phase_definition":{"min_age":"0ms","actions":{"shrink":{"number_of_shards":7}}},"version":1,"modified_date_in_millis":1659124628983}], steps: [[{"phase":"warm","action":"unfollow","name":"branch-check-unfollow-prerequisites"} => null, {"phase":"warm","action":"unfollow","name":"wait-for-indexing-complete"} => {"phase":"warm","action":"unfollow","name":"wait-for-follow-shard-tasks"}, {"phase":"warm","action":"unfollow","name":"wait-for-follow-shard-tasks"} => {"phase":"warm","action":"unfollow","name":"pause-follower-index"}, {"phase":"warm","action":"unfollow","name":"pause-follower-index"} => {"phase":"warm","action":"unfollow","name":"close-follower-index"}, {"phase":"warm","action":"unfollow","name":"close-follower-index"} => {"phase":"warm","action":"unfollow","name":"unfollow-follower-index"}, {"phase":"warm","action":"unfollow","name":"unfollow-follower-index"} => {"phase":"warm","action":"unfollow","name":"open-follower-index"}, {"phase":"warm","action":"unfollow","name":"open-follower-index"} => {"phase":"warm","action":"unfollow","name":"wait-for-index-color"}, {"phase":"warm","action":"unfollow","name":"wait-for-index-color"} => {"phase":"warm","action":"migrate","name":"branch-check-skip-action"}, {"phase":"warm","action":"migrate","name":"branch-check-skip-action"} => null, {"phase":"warm","action":"migrate","name":"migrate"} => {"phase":"warm","action":"migrate","name":"check-migration"}, {"phase":"warm","action":"migrate","name":"check-migration"} => {"phase":"warm","action":"shrink","name":"branch-check-prerequisites"}, {"phase":"warm","action":"shrink","name":"branch-check-prerequisites"} => null, {"phase":"warm","action":"shrink","name":"check-not-write-index"} => {"phase":"warm","action":"shrink","name":"wait-for-shard-history-leases"}, {"phase":"warm","action":"shrink","name":"wait-for-shard-history-leases"} => {"phase":"warm","action":"shrink","name":"readonly"}, {"phase":"warm","action":"shrink","name":"readonly"} => {"phase":"warm","action":"shrink","name":"check-target-shards-count"}, {"phase":"warm","action":"shrink","name":"check-target-shards-count"} with numberOfShards:7, {"phase":"warm","action":"shrink","name":"cleanup-shrink-index"} => {"phase":"warm","action":"shrink","name":"generate-index-name"}, {"phase":"warm","action":"shrink","name":"generate-index-name"} => {"phase":"warm","action":"shrink","name":"set-single-node-allocation"}, {"phase":"warm","action":"shrink","name":"set-single-node-allocation"} => {"phase":"warm","action":"shrink","name":"check-shrink-allocation"}, {"phase":"warm","action":"shrink","name":"check-shrink-allocation"} => {"phase":"warm","action":"shrink","name":"shrink"}, {"phase":"warm","action":"shrink","name":"shrink"} => {"phase":"warm","action":"shrink","name":"shrunk-shards-allocated"}, {"phase":"warm","action":"shrink","name":"shrunk-shards-allocated"} => {"phase":"warm","action":"shrink","name":"copy-execution-state"}, {"phase":"warm","action":"shrink","name":"copy-execution-state"} => {"phase":"warm","action":"shrink","name":"branch-on-datastream-check"}, {"phase":"warm","action":"shrink","name":"branch-on-datastream-check"} => null, {"phase":"warm","action":"shrink","name":"aliases"} => {"phase":"warm","action":"shrink","name":"is-shrunken-index"}, {"phase":"warm","action":"shrink","name":"is-shrunken-index"} => {"phase":"warm","action":"complete","name":"complete"}, {"phase":"warm","action":"shrink","name":"replace-datastream-backing-index"} => {"phase":"warm","action":"shrink","name":"delete"}, {"phase":"warm","action":"shrink","name":"delete"} => {"phase":"warm","action":"shrink","name":"is-shrunken-index"}, {"phase":"warm","action":"complete","name":"complete"} => null]]
[2022-07-29T22:57:19,043][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] [index-oewyaxaocl] maybe running step (CheckTargetShardsCountStep) after state change: {"phase":"warm","action":"shrink","name":"check-target-shards-count"}
[2022-07-29T22:57:19,043][DEBUG][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] [index-oewyaxaocl] running policy with current-step [{"phase":"warm","action":"shrink","name":"check-target-shards-count"} with numberOfShards:7]
[2022-07-29T22:57:19,044][TRACE][o.e.x.i.h.ILMHistoryStore] [javaRestTest-0] queueing ILM history item for indexing [ilm-history-5]: [{"index":"index-oewyaxaocl","policy":"policy-GeICC","@timestamp":1659124639044,"index_age":13838,"success":true,"state":{"phase":"warm","phase_definition":"{\"policy\":\"policy-GeICC\",\"phase_definition\":{\"min_age\":\"0ms\",\"actions\":{\"shrink\":{\"number_of_shards\":7}}},\"version\":1,\"modified_date_in_millis\":1659124628983}","action_time":"1659124635556","phase_time":"1659124632508","action":"shrink","step":"check-target-shards-count","creation_date":"1659124625206","step_time":"1659124638697"}}]

Updated policy and verified that the target shard count is valid:

### Update policy with correct target shards count:
[2022-07-29T22:57:19,062][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [javaRestTest-0] updating index lifecycle policy [policy-GeICC]
[2022-07-29T22:57:19,063][DEBUG][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] [index-oewyaxaocl] updated policy [policy-GeICC] contains the same phase step keys and can be refreshed
[2022-07-29T22:57:19,063][TRACE][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] [index-oewyaxaocl] updating cached phase definition for policy [policy-GeICC]
[2022-07-29T22:57:19,063][DEBUG][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] refreshed policy [policy-GeICC] phase definition for [1] indices
[2022-07-29T22:57:19,404][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] [index-oewyaxaocl] retrieved current step key: {"phase":"warm","action":"shrink","name":"check-target-shards-count"}
[2022-07-29T22:57:19,404][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] [index-oewyaxaocl] maybe running periodic step (CheckTargetShardsCountStep) with current step {"phase":"warm","action":"shrink","name":"check-target-shards-count"}
[2022-07-29T22:57:19,404][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] [index-oewyaxaocl] ignoring non periodic step execution from step transition [{"phase":"warm","action":"shrink","name":"check-target-shards-count"}]
[2022-07-29T22:57:19,405][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] cs-change-async-wait-callback, [.ds-ilm-history-5-2022.07.29-000001] current-step: {"phase":"hot","action":"rollover","name":"check-rollover-ready"}
[2022-07-29T22:57:19,408][TRACE][o.e.x.i.PolicyStepsRegistry] [javaRestTest-0] updating cached steps for [policy-GeICC] policy, new steps: [{"phase":"new","action":"init","name":"init"}, {"phase":"new","action":"complete","name":"complete"}, {"phase":"warm","action":"unfollow","name":"branch-check-unfollow-prerequisites"}, {"phase":"warm","action":"unfollow","name":"wait-for-indexing-complete"}, {"phase":"warm","action":"unfollow","name":"wait-for-follow-shard-tasks"}, {"phase":"warm","action":"unfollow","name":"pause-follower-index"}, {"phase":"warm","action":"unfollow","name":"close-follower-index"}, {"phase":"warm","action":"unfollow","name":"unfollow-follower-index"}, {"phase":"warm","action":"unfollow","name":"open-follower-index"}, {"phase":"warm","action":"unfollow","name":"wait-for-index-color"}, {"phase":"warm","action":"migrate","name":"branch-check-skip-action"}, {"phase":"warm","action":"migrate","name":"migrate"}, {"phase":"warm","action":"migrate","name":"check-migration"}, {"phase":"warm","action":"shrink","name":"branch-check-prerequisites"}, {"phase":"warm","action":"shrink","name":"check-not-write-index"}, {"phase":"warm","action":"shrink","name":"wait-for-shard-history-leases"}, {"phase":"warm","action":"shrink","name":"readonly"}, {"phase":"warm","action":"shrink","name":"check-target-shards-count"}, {"phase":"warm","action":"shrink","name":"cleanup-shrink-index"}, {"phase":"warm","action":"shrink","name":"generate-index-name"}, {"phase":"warm","action":"shrink","name":"set-single-node-allocation"}, {"phase":"warm","action":"shrink","name":"check-shrink-allocation"}, {"phase":"warm","action":"shrink","name":"shrink"}, {"phase":"warm","action":"shrink","name":"shrunk-shards-allocated"}, {"phase":"warm","action":"shrink","name":"copy-execution-state"}, {"phase":"warm","action":"shrink","name":"branch-on-datastream-check"}, {"phase":"warm","action":"shrink","name":"aliases"}, {"phase":"warm","action":"shrink","name":"is-shrunken-index"}, {"phase":"warm","action":"shrink","name":"replace-datastream-backing-index"}, {"phase":"warm","action":"shrink","name":"delete"}, {"phase":"warm","action":"complete","name":"complete"}]
[2022-07-29T22:57:19,469][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] [index-oewyaxaocl] retrieved current step key: {"phase":"warm","action":"shrink","name":"check-target-shards-count"}
[2022-07-29T22:57:19,508][TRACE][o.e.x.i.PolicyStepsRegistry] [javaRestTest-0] parsed steps for policy [policy-GeICC] in phase [warm], definition: [{"policy":"policy-GeICC","phase_definition":{"min_age":"0ms","actions":{"shrink":{"number_of_shards":2}}},"version":2,"modified_date_in_millis":1659124639062}], steps: [[{"phase":"warm","action":"unfollow","name":"branch-check-unfollow-prerequisites"} => null, {"phase":"warm","action":"unfollow","name":"wait-for-indexing-complete"} => 

After that we do not see a cluster change happening to trigger the check-target-shards-count step. Whereas in the test that pass we see a state change right after the policy change:

[2022-07-29T22:56:21,343][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [javaRestTest-0] updating index lifecycle policy [policy-dixiy]
[2022-07-29T22:56:21,344][DEBUG][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] [index-wegktefhrd] updated policy [policy-dixiy] contains the same phase step keys and can be refreshed
[2022-07-29T22:56:21,344][TRACE][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] [index-wegktefhrd] updating cached phase definition for policy [policy-dixiy]
[2022-07-29T22:56:21,344][DEBUG][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] refreshed policy [policy-dixiy] phase definition for [1] indices
[2022-07-29T22:56:21,370][TRACE][o.e.x.i.IndexLifecycleService] [javaRestTest-0] job triggered: ilm, 1659124581366, 1659124581370
[2022-07-29T22:56:21,378][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] [index-wegktefhrd] retrieved current step key: {"phase":"warm","action":"shrink","name":"check-target-shards-count"}
[2022-07-29T22:56:21,379][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] [index-wegktefhrd] maybe running periodic step (CheckTargetShardsCountStep) with current step {"phase":"warm","action":"shrink","name":"check-target-shards-count"}
[2022-07-29T22:56:21,379][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] [index-wegktefhrd] ignoring non periodic step execution from step transition [{"phase":"warm","action":"shrink","name":"check-target-shards-count"}]
[2022-07-29T22:56:21,379][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] [.ds-ilm-history-5-2022.07.29-000001] retrieved current step key: {"phase":"hot","action":"rollover","name":"check-rollover-ready"}
[2022-07-29T22:56:21,380][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] [.ds-ilm-history-5-2022.07.29-000001] maybe running periodic step (WaitForRolloverReadyStep) with current step {"phase":"hot","action":"rollover","name":"check-rollover-ready"}
[2022-07-29T22:56:21,380][DEBUG][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] [.ds-ilm-history-5-2022.07.29-000001] running periodic policy with current-step [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]
[2022-07-29T22:56:21,382][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] cs-change-async-wait-callback, [.ds-ilm-history-5-2022.07.29-000001] current-step: {"phase":"hot","action":"rollover","name":"check-rollover-ready"}
[2022-07-29T22:56:21,637][TRACE][o.e.x.i.PolicyStepsRegistry] [javaRestTest-0] updating cached steps for [policy-dixiy] policy, new steps: [{"phase":"new","action":"init","name":"init"}, {"phase":"new","action":"complete","name":"complete"}, {"phase":"warm","action":"unfollow","name":"branch-check-unfollow-prerequisites"}, {"phase":"warm","action":"unfollow","name":"wait-for-indexing-complete"}, {"phase":"warm","action":"unfollow","name":"wait-for-follow-shard-tasks"}, {"phase":"warm","action":"unfollow","name":"pause-follower-index"}, {"phase":"warm","action":"unfollow","name":"close-follower-index"}, {"phase":"warm","action":"unfollow","name":"unfollow-follower-index"}, {"phase":"warm","action":"unfollow","name":"open-follower-index"}, {"phase":"warm","action":"unfollow","name":"wait-for-index-color"}, {"phase":"warm","action":"migrate","name":"branch-check-skip-action"}, {"phase":"warm","action":"migrate","name":"migrate"}, {"phase":"warm","action":"migrate","name":"check-migration"}, {"phase":"warm","action":"shrink","name":"branch-check-prerequisites"}, {"phase":"warm","action":"shrink","name":"check-not-write-index"}, {"phase":"warm","action":"shrink","name":"wait-for-shard-history-leases"}, {"phase":"warm","action":"shrink","name":"readonly"}, {"phase":"warm","action":"shrink","name":"check-target-shards-count"}, {"phase":"warm","action":"shrink","name":"cleanup-shrink-index"}, {"phase":"warm","action":"shrink","name":"generate-index-name"}, {"phase":"warm","action":"shrink","name":"set-single-node-allocation"}, {"phase":"warm","action":"shrink","name":"check-shrink-allocation"}, {"phase":"warm","action":"shrink","name":"shrink"}, {"phase":"warm","action":"shrink","name":"shrunk-shards-allocated"}, {"phase":"warm","action":"shrink","name":"copy-execution-state"}, {"phase":"warm","action":"shrink","name":"branch-on-datastream-check"}, {"phase":"warm","action":"shrink","name":"aliases"}, {"phase":"warm","action":"shrink","name":"is-shrunken-index"}, {"phase":"warm","action":"shrink","name":"replace-datastream-backing-index"}, {"phase":"warm","action":"shrink","name":"delete"}, {"phase":"warm","action":"complete","name":"complete"}]
[2022-07-29T22:56:21,657][INFO ][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] [index-wegktefhrd] State changed for index
[2022-07-29T22:56:21,717][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] [index-wegktefhrd] retrieved current step key: {"phase":"warm","action":"shrink","name":"check-target-shards-count"}
[2022-07-29T22:56:21,737][TRACE][o.e.x.i.PolicyStepsRegistry] [javaRestTest-0] parsed steps for policy [policy-dixiy] in phase [warm], definition: [{"policy":"policy-dixiy","phase_definition":{"min_age":"0ms","actions":{"shrink":{"number_of_shards":1}}},"version":2,"modified_date_in_millis":1659124581343}], steps: [[{"phase":"warm","action":"unfollow","name":"branch-check-unfollow-prerequisites"} => null, {"phase":"warm","action":"unfollow","name":"wait-for-indexing-complete"} => {"phase":"warm","action":"unfollow","name":"wait-for-follow-shard-tasks"}, {"phase":"warm","action":"unfollow","name":"wait-for-follow-shard-tasks"} => {"phase":"warm","action":"unfollow","name":"pause-follower-index"}, {"phase":"warm","action":"unfollow","name":"pause-follower-index"} => {"phase":"warm","action":"unfollow","name":"close-follower-index"}, {"phase":"warm","action":"unfollow","name":"close-follower-index"} => {"phase":"warm","action":"unfollow","name":"unfollow-follower-index"}, {"phase":"warm","action":"unfollow","name":"unfollow-follower-index"} => {"phase":"warm","action":"unfollow","name":"open-follower-index"}, {"phase":"warm","action":"unfollow","name":"open-follower-index"} => {"phase":"warm","action":"unfollow","name":"wait-for-index-color"}, {"phase":"warm","action":"unfollow","name":"wait-for-index-color"} => {"phase":"warm","action":"migrate","name":"branch-check-skip-action"}, {"phase":"warm","action":"migrate","name":"branch-check-skip-action"} => null, {"phase":"warm","action":"migrate","name":"migrate"} => {"phase":"warm","action":"migrate","name":"check-migration"}, {"phase":"warm","action":"migrate","name":"check-migration"} => {"phase":"warm","action":"shrink","name":"branch-check-prerequisites"}, {"phase":"warm","action":"shrink","name":"branch-check-prerequisites"} => null, {"phase":"warm","action":"shrink","name":"check-not-write-index"} => {"phase":"warm","action":"shrink","name":"wait-for-shard-history-leases"}, {"phase":"warm","action":"shrink","name":"wait-for-shard-history-leases"} => {"phase":"warm","action":"shrink","name":"readonly"}, {"phase":"warm","action":"shrink","name":"readonly"} => {"phase":"warm","action":"shrink","name":"check-target-shards-count"}, {"phase":"warm","action":"shrink","name":"check-target-shards-count"} with numberOfShards:1, {"phase":"warm","action":"shrink","name":"cleanup-shrink-index"} => {"phase":"warm","action":"shrink","name":"generate-index-name"}, {"phase":"warm","action":"shrink","name":"generate-index-name"} => {"phase":"warm","action":"shrink","name":"set-single-node-allocation"}, {"phase":"warm","action":"shrink","name":"set-single-node-allocation"} => {"phase":"warm","action":"shrink","name":"check-shrink-allocation"}, {"phase":"warm","action":"shrink","name":"check-shrink-allocation"} => {"phase":"warm","action":"shrink","name":"shrink"}, {"phase":"warm","action":"shrink","name":"shrink"} => {"phase":"warm","action":"shrink","name":"shrunk-shards-allocated"}, {"phase":"warm","action":"shrink","name":"shrunk-shards-allocated"} => {"phase":"warm","action":"shrink","name":"copy-execution-state"}, {"phase":"warm","action":"shrink","name":"copy-execution-state"} => {"phase":"warm","action":"shrink","name":"branch-on-datastream-check"}, {"phase":"warm","action":"shrink","name":"branch-on-datastream-check"} => null, {"phase":"warm","action":"shrink","name":"aliases"} => {"phase":"warm","action":"shrink","name":"is-shrunken-index"}, {"phase":"warm","action":"shrink","name":"is-shrunken-index"} => {"phase":"warm","action":"complete","name":"complete"}, {"phase":"warm","action":"shrink","name":"replace-datastream-backing-index"} => {"phase":"warm","action":"shrink","name":"delete"}, {"phase":"warm","action":"shrink","name":"delete"} => {"phase":"warm","action":"shrink","name":"is-shrunken-index"}, {"phase":"warm","action":"complete","name":"complete"} => null]]
[2022-07-29T22:56:21,756][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] [index-wegktefhrd] maybe running step (CheckTargetShardsCountStep) after state change: {"phase":"warm","action":"shrink","name":"check-target-shards-count"}
[2022-07-29T22:56:21,756][DEBUG][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] [index-wegktefhrd] running policy with current-step [{"phase":"warm","action":"shrink","name":"check-target-shards-count"} with numberOfShards:1]
[2022-07-29T22:56:21,758][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] [index-wegktefhrd] retrieved current step key: {"phase":"warm","action":"shrink","name":"check-target-shards-count"}
[2022-07-29T22:56:21,793][TRACE][o.e.x.i.ExecuteStepsUpdateTask] [javaRestTest-0] [index-wegktefhrd] waiting for cluster state step condition (CheckTargetShardsCountStep) [{"phase":"warm","action":"shrink","name":"check-target-shards-count"}]
[2022-07-29T22:56:21,794][TRACE][o.e.x.i.ExecuteStepsUpdateTask] [javaRestTest-0] [index-wegktefhrd] cluster state step condition met successfully (CheckTargetShardsCountStep) [{"phase":"warm","action":"shrink","name":"check-target-shards-count"}], moving to next step {"phase":"warm","action":"shrink","name":"cleanup-shrink-index"}

I do not know why this is happening yet, but this explains why this is not a prominent ILM bug since in a live cluster there are more things to trigger a cluster change.

gmarouli avatar Jul 29 '22 19:07 gmarouli

So the culprit is here:

        if (busyIndices.contains(Tuple.tuple(indexMetadata.getIndex(), currentStepKey))) {
            // try later again, already doing work for this index at this step, no need to check for more work yet
            return;
        }

Sometimes when the update policy cluster change happens the index is still busy and it skips applying the change. I will try to trigger one more cluster change to see if this will stabilize the test.

gmarouli avatar Aug 01 '22 10:08 gmarouli

There is a hint for this one :). In org.elasticsearch.xpack.ilm.IndexLifecycleRunner:

/**
     * Tracks already executing {@link IndexLifecycleClusterStateUpdateTask} tasks in {@link #executingTasks} to prevent queueing up
     * duplicate cluster state updates.
     * TODO: refactor ILM logic so that this is not required any longer. It is unreasonably expensive to only filter out duplicate tasks at
     *       this point given how these tasks are mostly set up on the cluster state applier thread.
     *
     * @param source source string as used in {@link ClusterService#submitUnbatchedStateUpdateTask}
     * @param task   task to submit unless already tracked in {@link #executingTasks}.
     */
    private void submitUnlessAlreadyQueued(String source, IndexLifecycleClusterStateUpdateTask task) {
        if (executingTasks.add(task)) {
            final Tuple<Index, StepKey> dedupKey = Tuple.tuple(task.index, task.currentStepKey);
            // index+step-key combination on a best-effort basis to skip checking for more work for an index on CS application
            busyIndices.add(dedupKey);
            task.addListener(ActionListener.wrap(() -> {
                final boolean removed = executingTasks.remove(task);
                busyIndices.remove(dedupKey);
                assert removed : "tried to unregister unknown task [" + task + "]";
            }));
            clusterService.submitStateUpdateTask(source, task, ILM_TASK_CONFIG, ILM_TASK_EXECUTOR);
        } else {
            logger.trace("skipped redundant execution of [{}]", source);
        }
    }

gmarouli avatar Aug 04 '22 11:08 gmarouli

Relates to https://github.com/elastic/elasticsearch/pull/78390

gmarouli avatar Aug 04 '22 16:08 gmarouli

See https://github.com/elastic/elasticsearch/pull/89176 for the latest update.

gmarouli avatar Aug 08 '22 16:08 gmarouli