Overlord cannot elect a leader because of exception when reacquiring locks.
Affected Version
0.21.1
Description
2021-09-01T00:27:08,408 ERROR [LeaderSelector[/druid/overlord/_OVERLORD]] org.apache.druid.curator.discovery.CuratorDruidLeaderSelector - listener becomeLeader() failed. Unable to become leader: {class=org.apache.druid.curator.discovery.CuratorDruidLeaderSelector, exceptionType=class java.lang.RuntimeException, exceptionMessage=org.apache.druid.java.util.common.ISE: Could not reacquire lock on interval[2021-08-29T00:00:00.000Z/2021-08-30T00:00:00.000Z] version[2021-08-31T23:32:28.782Z] for task: index_parallel_mytable_aplllohg_2021-08-31T13:15:38.699Z}
java.lang.RuntimeException: org.apache.druid.java.util.common.ISE: Could not reacquire lock on interval[2021-08-29T00:00:00.000Z/2021-08-30T00:00:00.000Z] version[2021-08-31T23:32:28.782Z] for task: index_parallel_mytable_aplllohg_2021-08-31T13:15:38.699Z
at org.apache.druid.indexing.overlord.TaskMaster$1.becomeLeader(TaskMaster.java:159) ~[druid-indexing-service-0.21.1.jar:0.21.1]
at org.apache.druid.curator.discovery.CuratorDruidLeaderSelector$1.isLeader(CuratorDruidLeaderSelector.java:97) [druid-server-0.21.1.jar:0.21.1]
at org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:702) [curator-recipes-4.3.0.jar:4.3.0]
at org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:698) [curator-recipes-4.3.0.jar:4.3.0]
at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100) [curator-framework-4.3.0.jar:4.3.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_302]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_302]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
The error above completely prevented any Overlord instances to become a leader. Is this error so important that Overlord cannot perform election?
Faced with same issue in v0.22.0
workaround: delete all from druid_tasks table
The line that this log came from is https://github.com/apache/druid/blob/8296123d895db7d06bc4517db5e767afb7862b83/indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskMaster.java#L159
Which is throwing a new RuntimeException(e). But the reported stack trace only shows the top-level Runtime exception without the stack trace for the ISE that triggered it. As such, it's difficult to actually debug as there isn't an indication of what the bad state is. If this happens for anybody else at any point in time, please provide the full stacktrace so that we can determine what is going on.
Found this to be the issue in 0.19.0 as well. Just trying to understand why leader election would fail just because it's unable to acquire locks for a single task.
Happening in Imply version 2022.05 also (I think it's Druid 0.22 or 0.23). Workaround to get the overlords alive for us was, pretty much thanks to @vmorarian, deleting the offending tasks from the MySQL table (not all, just the ones where overlord was complaining)
@tanisdlj - can you share the overlord logs with us?
@abhishekagarwal87 sorry! I was out. There is a support ticket (10666) we opened in Imply and I think they are looking into it already, should be fixed soon-ish I guess
We are also experiencing this and have been manually intervening the metadata store to fix this (not ideal). Does anyone have updates on this?
@jasonwzm Is it reproducible under some conditions? I am looking into how to repro this locally It would be super helpful if one could share the logs of the overlord, coordinator, and the tasks which share the lock when the issue happens.
Closed with #13172