druid icon indicating copy to clipboard operation
druid copied to clipboard

Overlords fail to elect any leaders preventing ingestion tasks from beginning

Open cloventt opened this issue 2 years ago • 3 comments

Affected Version

0.22.1

Description

We have 2 coordinators acting as overlord. With no apparent cause, neither overlords were being elected as leader. We performed a rolling restart of every Druid service and our Zookeeper nodes, but nothing changed. We observed this log message:

listener becomeLeader() failed. Unable to become leader: {class=org.apache.druid.curator.discovery.CuratorDruidLeaderSelector, exceptionType=class java.lang.RuntimeException, exceptionMessage=org.apache.druid.java.util.common.ISE: Could not reacquire lock on interval[2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z] version[2022-03-19T01:36:35.743Z] for task: index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z}

We were only able to clear the problem and allow jobs to proceed by logging into the Postgres metadb and truncating the druid_tasklocks table.

Looking at the offending lock, we see it contains this information:

{"granularity":"timeChunk","type":"EXCLUSIVE","groupId":"index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z","dataSource":"events-hour","interval":"2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z","version":"2022-03-19T01:36:35.743Z","priority":50,"revoked":true}

The logs for the associated task begin like this:

Mar 19, 2022 @ 01:36:35.773 Task[index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z] started.
Mar 19, 2022 @ 01:36:37.933 Revoking task lock[TimeChunkLock{type=EXCLUSIVE, groupId='index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z', dataSource='events-hour', interval=2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z, version='2022-03-19T01:36:35.743Z', priority=50, revoked=false}] for task[index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z]
Mar 19, 2022 @ 01:36:37.934 Replacing an existing lock[TimeChunkLock{type=EXCLUSIVE, groupId='index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z', dataSource='events-hour', interval=2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z, version='2022-03-19T01:36:35.743Z', priority=50, revoked=false}] with a new lock[TimeChunkLock{type=EXCLUSIVE, groupId='index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z', dataSource='events-hour', interval=2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z, version='2022-03-19T01:36:35.743Z', priority=50, revoked=true}] for task: index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z
Mar 19, 2022 @ 01:36:37.957 Revoked taskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_parallel_events-hour_lkialjgb_2022-03-19T01:35:26.522Z', dataSource='events-hour', interval=2022-03-19T00:00:00.000Z/2022-03-19T01:00:00.000Z, version='2022-03-19T01:36:35.743Z', priority=50, revoked=false}]

Around this time, one of the coordinator/overlords was restarted. The leader election seems to have occurred cleanly, but then it immediately began spamming the log from above.

cloventt avatar Mar 19 '22 04:03 cloventt

Ran into this exact same issue on 0.23.0

forzamehlano avatar Aug 15 '22 07:08 forzamehlano

java.lang.RuntimeException: org.apache.druid.java.util.common.ISE: Could not reacquire lock on interval[2020-11-09T00:00:00.000Z/2021-02-17T00:00:00.000Z] version[2022-09-12T05:08:57.538Z] for task: coordinator-issued_kill_cropwise-observations_boadhcea

Ran into same issue on 0.21.1

ManojCKrishna avatar Sep 13 '22 05:09 ManojCKrishna

We just experienced the same issue on version 0.22.2-SNAPSHOT. This is a huge problem as it blocks every subsequent ingestion :(

cp2587 avatar Oct 05 '22 08:10 cp2587

Experiencing the same issue on 24.0.0

ThomasBarach avatar Oct 21 '22 12:10 ThomasBarach

Depending on how you configured Overlord, but if you're using this param, ie you're using metadata DB to store task related information :

druid.indexer.storage.type=metadata

I've managed to unlock the situation by deleting the task from the metadata database. Not ideal, but my production cluster was stuck.

ThomasBarach avatar Oct 21 '22 12:10 ThomasBarach

It will be fixed in the upcoming 24.0.1 release.

abhishekagarwal87 avatar Oct 21 '22 13:10 abhishekagarwal87

Good to know, thanks @abhishekagarwal87 :)

ThomasBarach avatar Oct 21 '22 13:10 ThomasBarach