druid Overlord cannot elect a leader because of exception when reacquiring locks.

Affected Version

0.21.1

Description

2021-09-01T00:27:08,408 ERROR [LeaderSelector[/druid/overlord/_OVERLORD]] org.apache.druid.curator.discovery.CuratorDruidLeaderSelector - listener becomeLeader() failed. Unable to become leader: {class=org.apache.druid.curator.discovery.CuratorDruidLeaderSelector, exceptionType=class java.lang.RuntimeException, exceptionMessage=org.apache.druid.java.util.common.ISE: Could not reacquire lock on interval[2021-08-29T00:00:00.000Z/2021-08-30T00:00:00.000Z] version[2021-08-31T23:32:28.782Z] for task: index_parallel_mytable_aplllohg_2021-08-31T13:15:38.699Z}
java.lang.RuntimeException: org.apache.druid.java.util.common.ISE: Could not reacquire lock on interval[2021-08-29T00:00:00.000Z/2021-08-30T00:00:00.000Z] version[2021-08-31T23:32:28.782Z] for task: index_parallel_mytable_aplllohg_2021-08-31T13:15:38.699Z
    at org.apache.druid.indexing.overlord.TaskMaster$1.becomeLeader(TaskMaster.java:159) ~[druid-indexing-service-0.21.1.jar:0.21.1]
    at org.apache.druid.curator.discovery.CuratorDruidLeaderSelector$1.isLeader(CuratorDruidLeaderSelector.java:97) [druid-server-0.21.1.jar:0.21.1]
    at org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:702) [curator-recipes-4.3.0.jar:4.3.0]
    at org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:698) [curator-recipes-4.3.0.jar:4.3.0]
    at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100) [curator-framework-4.3.0.jar:4.3.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_302]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_302]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]

The error above completely prevented any Overlord instances to become a leader. Is this error so important that Overlord cannot perform election?

Sep 01 '21 18:09 didip

Faced with same issue in v0.22.0

Oct 18 '21 13:10 vmorarian

workaround: delete all from druid_tasks table

Oct 18 '21 15:10 vmorarian

The line that this log came from is https://github.com/apache/druid/blob/8296123d895db7d06bc4517db5e767afb7862b83/indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskMaster.java#L159

Which is throwing a new RuntimeException(e). But the reported stack trace only shows the top-level Runtime exception without the stack trace for the ISE that triggered it. As such, it's difficult to actually debug as there isn't an indication of what the bad state is. If this happens for anybody else at any point in time, please provide the full stacktrace so that we can determine what is going on.

Nov 18 '21 02:11 imply-cheddar

Found this to be the issue in 0.19.0 as well. Just trying to understand why leader election would fail just because it's unable to acquire locks for a single task.

Feb 08 '22 04:02 bharadwajrembar

Happening in Imply version 2022.05 also (I think it's Druid 0.22 or 0.23). Workaround to get the overlords alive for us was, pretty much thanks to @vmorarian, deleting the offending tasks from the MySQL table (not all, just the ones where overlord was complaining)

Aug 08 '22 11:08 tanisdlj

@tanisdlj - can you share the overlord logs with us?

Aug 11 '22 08:08 abhishekagarwal87

@abhishekagarwal87 sorry! I was out. There is a support ticket (10666) we opened in Imply and I think they are looking into it already, should be fixed soon-ish I guess

Aug 26 '22 08:08 tanisdlj

We are also experiencing this and have been manually intervening the metadata store to fix this (not ideal). Does anyone have updates on this?

Sep 08 '22 19:09 jasonwzm

@jasonwzm Is it reproducible under some conditions? I am looking into how to repro this locally It would be super helpful if one could share the logs of the overlord, coordinator, and the tasks which share the lock when the issue happens.

Sep 09 '22 05:09 cryptoe

Closed with #13172

Oct 17 '22 09:10 kfaraz