druid icon indicating copy to clipboard operation
druid copied to clipboard

Overlord cannot elect a leader because of exception when reacquiring locks.

Open didip opened this issue 4 years ago • 6 comments

Affected Version

0.21.1

Description

2021-09-01T00:27:08,408 ERROR [LeaderSelector[/druid/overlord/_OVERLORD]] org.apache.druid.curator.discovery.CuratorDruidLeaderSelector - listener becomeLeader() failed. Unable to become leader: {class=org.apache.druid.curator.discovery.CuratorDruidLeaderSelector, exceptionType=class java.lang.RuntimeException, exceptionMessage=org.apache.druid.java.util.common.ISE: Could not reacquire lock on interval[2021-08-29T00:00:00.000Z/2021-08-30T00:00:00.000Z] version[2021-08-31T23:32:28.782Z] for task: index_parallel_mytable_aplllohg_2021-08-31T13:15:38.699Z}
java.lang.RuntimeException: org.apache.druid.java.util.common.ISE: Could not reacquire lock on interval[2021-08-29T00:00:00.000Z/2021-08-30T00:00:00.000Z] version[2021-08-31T23:32:28.782Z] for task: index_parallel_mytable_aplllohg_2021-08-31T13:15:38.699Z
    at org.apache.druid.indexing.overlord.TaskMaster$1.becomeLeader(TaskMaster.java:159) ~[druid-indexing-service-0.21.1.jar:0.21.1]
    at org.apache.druid.curator.discovery.CuratorDruidLeaderSelector$1.isLeader(CuratorDruidLeaderSelector.java:97) [druid-server-0.21.1.jar:0.21.1]
    at org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:702) [curator-recipes-4.3.0.jar:4.3.0]
    at org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:698) [curator-recipes-4.3.0.jar:4.3.0]
    at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100) [curator-framework-4.3.0.jar:4.3.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_302]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_302]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]

The error above completely prevented any Overlord instances to become a leader. Is this error so important that Overlord cannot perform election?

didip avatar Sep 01 '21 18:09 didip

Faced with same issue in v0.22.0

vmorarian avatar Oct 18 '21 13:10 vmorarian

workaround: delete all from druid_tasks table

vmorarian avatar Oct 18 '21 15:10 vmorarian

The line that this log came from is https://github.com/apache/druid/blob/8296123d895db7d06bc4517db5e767afb7862b83/indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskMaster.java#L159

Which is throwing a new RuntimeException(e). But the reported stack trace only shows the top-level Runtime exception without the stack trace for the ISE that triggered it. As such, it's difficult to actually debug as there isn't an indication of what the bad state is. If this happens for anybody else at any point in time, please provide the full stacktrace so that we can determine what is going on.

imply-cheddar avatar Nov 18 '21 02:11 imply-cheddar

Found this to be the issue in 0.19.0 as well. Just trying to understand why leader election would fail just because it's unable to acquire locks for a single task.

bharadwajrembar avatar Feb 08 '22 04:02 bharadwajrembar

Happening in Imply version 2022.05 also (I think it's Druid 0.22 or 0.23). Workaround to get the overlords alive for us was, pretty much thanks to @vmorarian, deleting the offending tasks from the MySQL table (not all, just the ones where overlord was complaining)

tanisdlj avatar Aug 08 '22 11:08 tanisdlj

@tanisdlj - can you share the overlord logs with us?

abhishekagarwal87 avatar Aug 11 '22 08:08 abhishekagarwal87

@abhishekagarwal87 sorry! I was out. There is a support ticket (10666) we opened in Imply and I think they are looking into it already, should be fixed soon-ish I guess

tanisdlj avatar Aug 26 '22 08:08 tanisdlj

We are also experiencing this and have been manually intervening the metadata store to fix this (not ideal). Does anyone have updates on this?

jasonwzm avatar Sep 08 '22 19:09 jasonwzm

@jasonwzm Is it reproducible under some conditions? I am looking into how to repro this locally It would be super helpful if one could share the logs of the overlord, coordinator, and the tasks which share the lock when the issue happens.

cryptoe avatar Sep 09 '22 05:09 cryptoe

Closed with #13172

kfaraz avatar Oct 17 '22 09:10 kfaraz