druid Overlords fail to elect any leaders preventing ingestion tasks from beginning

Affected Version

0.22.1

Description

We have 2 coordinators acting as overlord. With no apparent cause, neither overlords were being elected as leader. We performed a rolling restart of every Druid service and our Zookeeper nodes, but nothing changed. We observed this log message:

listener becomeLeader() failed. Unable to become leader: {class=org.apache.druid.curator.discovery.CuratorDruidLeaderSelector, exceptionType=class java.lang.RuntimeException, exceptionMessage=org.apache.druid.java.util.common.ISE: Could not reacquire lock on interval[2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z] version[2022-03-19T01:36:35.743Z] for task: index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z}

We were only able to clear the problem and allow jobs to proceed by logging into the Postgres metadb and truncating the druid_tasklocks table.

Looking at the offending lock, we see it contains this information:

{"granularity":"timeChunk","type":"EXCLUSIVE","groupId":"index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z","dataSource":"events-hour","interval":"2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z","version":"2022-03-19T01:36:35.743Z","priority":50,"revoked":true}

The logs for the associated task begin like this:

Mar 19, 2022 @ 01:36:35.773 Task[index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z] started.
Mar 19, 2022 @ 01:36:37.933 Revoking task lock[TimeChunkLock{type=EXCLUSIVE, groupId='index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z', dataSource='events-hour', interval=2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z, version='2022-03-19T01:36:35.743Z', priority=50, revoked=false}] for task[index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z]
Mar 19, 2022 @ 01:36:37.934 Replacing an existing lock[TimeChunkLock{type=EXCLUSIVE, groupId='index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z', dataSource='events-hour', interval=2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z, version='2022-03-19T01:36:35.743Z', priority=50, revoked=false}] with a new lock[TimeChunkLock{type=EXCLUSIVE, groupId='index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z', dataSource='events-hour', interval=2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z, version='2022-03-19T01:36:35.743Z', priority=50, revoked=true}] for task: index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z
Mar 19, 2022 @ 01:36:37.957 Revoked taskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z', dataSource='events-hour', interval=2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z, version='2022-03-19T01:36:35.743Z', priority=50, revoked=false}]

Around this time, one of the coordinator/overlords was restarted. The leader election seems to have occurred cleanly, but then it immediately began spamming the log from above.

Mar 19 '22 04:03 cloventt

Ran into this exact same issue on 0.23.0

Aug 15 '22 07:08 forzamehlano

java.lang.RuntimeException: org.apache.druid.java.util.common.ISE: Could not reacquire lock on interval[2020-11-09T00:00:00.000Z/2021-02-17T00:00:00.000Z] version[2022-09-12T05:08:57.538Z] for task: coordinator-issued_kill_cropwise-observations_boadhcea

Ran into same issue on 0.21.1