cassandra-reaper Repair stuck - failed to take lead on node

Spotted on 2.2.1 (4be9bfc)

I've had a repair running for a little over a day, and it's repaired 1075 / 1076 segments. The last segment seems to be stuck, with the following error in the logs:

INFO   [2021-03-09 22:18:59,768] [x-cassandra-preprod:00000000-0000-0003-0000-000000000000] i.c.s.RepairRunner - Attempting to run new segment...
INFO   [2021-03-09 22:19:01,671] [x-cassandra-preprod:00000000-0000-0003-0000-000000000000] i.c.s.RepairRunner - Next segment to run : 00000000-0000-03f6-0000-000000000000
INFO   [2021-03-09 22:19:03,985] [x-cassandra-preprod:00000000-0000-0003-0000-000000000000] i.c.s.PostgresStorage - Failed to take lead on node x.x.x.x for segment 00000000-0000-03f6-0000-000000000000

Edit: Looking into more, it seems all repairs are stucking with the same message in the logs.

┆Issue is synchronized with this Jira Task by Unito ┆Issue Number: K8SSAND-229

Mar 09 '21 22:03 larsiny

ouch... Could you share the content of the running_repairs table after obfuscating the IPs (but keeping them consistent so we know which node is which). Are there segments in running state or none? After that, can you truncate that table and see if that unblocks the repairs? The postgres locks rely on timestamps since we don't have TTLs there, maybe there's something wrong there.

Mar 10 '21 06:03 adejanovski

Hi @adejanovski,

No, there are no segments currently running (apart from a repair on a different cluster that's currently not stuck).

Below is the output of running_repairs. Repair ID 7 is the only that's not stuck at the moment.

REAPER=> SELECT * FROM running_repairs ORDER BY last_heartbeat;
 repair_id |     node     | reaper_instance_host |  reaper_instance_id  | segment_id |        last_heartbeat
-----------+--------------+----------------------+----------------------+------------+-------------------------------
         4 | 10.0.0.1     | 10.10.1.1              | -7852674672088628830 |       2528 | 2021-03-09 05:35:51.357303+00
         4 | 10.0.0.2     | 10.10.1.1              | -7852674672088628830 |       2528 | 2021-03-09 05:35:51.357303+00
         4 | 10.0.0.3     | 10.10.1.1              | -7852674672088628830 |       2528 | 2021-03-09 05:35:51.357303+00
         4 | 10.0.0.4     | 10.10.1.1              | -7852674672088628830 |       2528 | 2021-03-09 05:35:51.357303+00
         4 | 10.0.0.5     | 10.10.1.1              | -7852674672088628830 |       2528 | 2021-03-09 05:35:51.357303+00
         4 | 10.0.0.6     | 10.10.1.1              | -7852674672088628830 |       2528 | 2021-03-09 05:35:51.357303+00
         3 | 10.0.0.7     | 10.10.1.1              | -7852674672088628830 |       1014 | 2021-03-09 07:10:16.044101+00
         3 | 10.0.0.8     | 10.10.1.1              | -7852674672088628830 |       1014 | 2021-03-09 07:10:16.044101+00
         3 | 10.0.0.9     | 10.10.1.1              | -7852674672088628830 |       1014 | 2021-03-09 07:10:16.044101+00
         3 | 10.0.0.10    | 10.10.1.1              | -7852674672088628830 |       1014 | 2021-03-09 07:10:16.044101+00
         3 | 10.0.0.11    | 10.10.1.1              | -7852674672088628830 |       1014 | 2021-03-09 07:10:16.044101+00
         3 | 10.0.0.12    | 10.10.1.1              | -7852674672088628830 |       1014 | 2021-03-09 07:10:16.044101+00
        13 | 10.1.0.1     | 10.10.1.1              | -7852674672088628830 |      34802 | 2021-03-09 07:58:49.158439+00
        13 | 10.1.0.2     | 10.10.1.1              | -7852674672088628830 |      34802 | 2021-03-09 07:58:49.158439+00
        13 | 10.1.0.3     | 10.10.1.1              | -7852674672088628830 |      34802 | 2021-03-09 07:58:49.158439+00
        13 | 10.1.0.4     | 10.10.1.1              | -7852674672088628830 |      30543 | 2021-03-09 08:20:27.999968+00
        13 | 10.1.0.5     | 10.10.1.1              | -7852674672088628830 |      30543 | 2021-03-09 08:20:27.999968+00
        13 | 10.1.0.6     | 10.10.1.1              | -7852674672088628830 |      30543 | 2021-03-09 08:20:27.999968+00
        13 | 10.1.0.7     | 10.10.1.1              | -7852674672088628830 |      33930 | 2021-03-09 08:20:45.829408+00
        13 | 10.1.0.8     | 10.10.1.1              | -7852674672088628830 |      33930 | 2021-03-09 08:20:45.829408+00
        13 | 10.1.0.9     | 10.10.1.1              | -7852674672088628830 |      33930 | 2021-03-09 08:20:45.829408+00
        13 | 10.1.0.10    | 10.10.1.1              | -7852674672088628830 |      37890 | 2021-03-09 09:28:10.67804+00
        13 | 10.1.0.11    | 10.10.1.1              | -7852674672088628830 |      37890 | 2021-03-09 09:28:10.67804+00
        13 | 10.1.0.12    | 10.10.1.1              | -7852674672088628830 |      37890 | 2021-03-09 09:28:10.67804+00
        13 | 10.1.0.13    | 10.10.1.1              | -7852674672088628830 |      32071 | 2021-03-09 09:28:29.467767+00
        13 | 10.1.0.14    | 10.10.1.1              | -7852674672088628830 |      32071 | 2021-03-09 09:28:29.467767+00
        13 | 10.1.0.15    | 10.10.1.1              | -7852674672088628830 |      32071 | 2021-03-09 09:28:29.467767+00
        13 | 10.1.0.16    | 10.10.1.1              | -7852674672088628830 |      33390 | 2021-03-09 10:56:01.861937+00
        13 | 10.1.0.17    | 10.10.1.1              | -7852674672088628830 |      33390 | 2021-03-09 10:56:01.861937+00
        13 | 10.1.0.18    | 10.10.1.1              | -7852674672088628830 |      33390 | 2021-03-09 10:56:01.861937+00
        13 | 10.1.0.19    | 10.10.1.1              | -7852674672088628830 |      35382 | 2021-03-09 16:56:51.154691+00
        13 | 10.1.0.20    | 10.10.1.1              | -7852674672088628830 |      35382 | 2021-03-09 16:56:51.154691+00
        13 | 10.1.0.21    | 10.10.1.1              | -7852674672088628830 |      35382 | 2021-03-09 16:56:51.154691+00
         7 | 10.2.0.1     | 10.10.1.1              | -6792121109870000602 |       6555 | 2021-03-10 07:18:13.117629+00
         7 | 10.2.0.2     | 10.10.1.1              | -6792121109870000602 |       6555 | 2021-03-10 07:18:13.117629+00
         7 | 10.2.0.3     | 10.10.1.1              | -6792121109870000602 |       6555 | 2021-03-10 07:18:13.117629+00
         7 | 10.2.0.4     | 10.10.1.1              | -6792121109870000602 |       4598 | 2021-03-10 07:18:15.774315+00
         7 | 10.2.0.5     | 10.10.1.1              | -6792121109870000602 |       4598 | 2021-03-10 07:18:15.774315+00
         7 | 10.2.0.6     | 10.10.1.1              | -6792121109870000602 |       4598 | 2021-03-10 07:18:15.774315+00
         7 | 10.2.0.7     | 10.10.1.1              | -6792121109870000602 |       4917 | 2021-03-10 07:18:47.051861+00
         7 | 10.2.0.8     | 10.10.1.1              | -6792121109870000602 |       4917 | 2021-03-10 07:18:47.051861+00
         7 | 10.2.0.9     | 10.10.1.1              | -6792121109870000602 |       4917 | 2021-03-10 07:18:47.051861+00
         7 | 10.2.0.10    | 10.10.1.1              | -6792121109870000602 |       5767 | 2021-03-10 07:18:58.3956+00
         7 | 10.2.0.11    | 10.10.1.1              | -6792121109870000602 |       5767 | 2021-03-10 07:18:58.3956+00
         7 | 10.2.0.12    | 10.10.1.1              | -6792121109870000602 |       5767 | 2021-03-10 07:18:58.3956+00
(45 rows)

Truncating the running_repairs table unblocked the repairs and some segments seem to be completing, but I'm still seeing some errors in the logs:

INFO   [2021-03-10 07:35:04,697] [x:00000000-0000-0007-0000-000000000000:00000000-0000-1910-0000-000000000000] i.c.s.PostgresStorage - Failed to take lead on node x.x.x.x for segment 00000000-0000-1910-0000-000000000000

INFO   [2021-03-10 07:35:01,010] [x:00000000-0000-000d-0000-000000000000:00000000-0000-8445-0000-000000000000] i.c.s.PostgresStorage - Failed renewing lead on segment 00000000-0000-8445-0000-000000000000 for node x.x.x.x

ERROR  [2021-03-10 07:36:29,868] [x:00000000-0000-0007-0000-000000000000:00000000-0000-190e-0000-000000000000] i.c.s.RepairRunner - Executing SegmentRunner failed
java.lang.AssertionError: non-leader trying to update repair segment 00000000-0000-190e-0000-000000000000 of run 00000000-0000-0007-0000-000000000000
	at io.cassandrareaper.storage.PostgresStorage.updateRepairSegment(PostgresStorage.java:419)
	at io.cassandrareaper.service.SegmentRunner.postpone(SegmentRunner.java:198)
	at io.cassandrareaper.service.SegmentRunner.abort(SegmentRunner.java:213)
	at io.cassandrareaper.service.SegmentRunner.abort(SegmentRunner.java:226)
	at io.cassandrareaper.service.SegmentRunner.processTriggeredSegment(SegmentRunner.java:415)
	at io.cassandrareaper.service.SegmentRunner.runRepair(SegmentRunner.java:315)
	at io.cassandrareaper.service.SegmentRunner.run(SegmentRunner.java:161)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:117)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:38)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
	at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:241)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Mar 10 '21 07:03 larsiny

Error messages could be related to the truncation as we removed the existing locks. Still blocked repairs like you've seen should never happen and need to be thoroughly investigated.

Mar 10 '21 08:03 adejanovski

The java.lang.AssertionError error shows up in the logs every time the last segment in the repair is retried:

INFO   [2021-03-10 09:04:57,090] [x:00000000-0000-0003-0000-000000000000:00000000-0000-03f6-0000-000000000000] i.c.s.SegmentRunner - Repair command 20565 on segment 00000000-0000-03f6-0000-000000000000 returned with state NOT_STARTED
INFO   [2021-03-10 09:04:57,090] [x:00000000-0000-0003-0000-000000000000:00000000-0000-03f6-0000-000000000000] i.c.s.SegmentRunner - Repair command 20565 on segment 00000000-0000-03f6-0000-000000000000 never managed to start within timeout.
INFO   [2021-03-10 09:04:57,090] [x:00000000-0000-0003-0000-000000000000:00000000-0000-03f6-0000-000000000000] i.c.s.SegmentRunner - Postponing segment 00000000-0000-03f6-0000-000000000000
INFO   [2021-03-10 09:04:57,091] [x:00000000-0000-0003-0000-000000000000:00000000-0000-03f6-0000-000000000000] i.c.s.PostgresStorage - Failed renewing lead on segment 00000000-0000-03f6-0000-000000000000 for node x.x.x.x
ERROR  [2021-03-10 09:04:57,091] [x:00000000-0000-0003-0000-000000000000:00000000-0000-03f6-0000-000000000000] i.c.s.PostgresStorage - Failed to renew lead on segment 00000000-0000-0003-0000-000000000000
ERROR  [2021-03-10 09:04:57,092] [x:00000000-0000-0003-0000-000000000000:00000000-0000-03f6-0000-000000000000] i.c.s.RepairRunner - Executing SegmentRunner failed
java.lang.AssertionError: non-leader trying to update repair segment 00000000-0000-03f6-0000-000000000000 of run 00000000-0000-0003-0000-000000000000
	at io.cassandrareaper.storage.PostgresStorage.updateRepairSegment(PostgresStorage.java:419)
	at io.cassandrareaper.service.SegmentRunner.postpone(SegmentRunner.java:198)
	at io.cassandrareaper.service.SegmentRunner.abort(SegmentRunner.java:213)
	at io.cassandrareaper.service.SegmentRunner.abort(SegmentRunner.java:226)
	at io.cassandrareaper.service.SegmentRunner.processTriggeredSegment(SegmentRunner.java:415)
	at io.cassandrareaper.service.SegmentRunner.runRepair(SegmentRunner.java:315)
	at io.cassandrareaper.service.SegmentRunner.run(SegmentRunner.java:161)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:117)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:38)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
	at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:241)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Mar 10 '21 09:03 larsiny

Hi, I'm facing the same issue, what information could aid in the resolution of this ticket?

Thanks in advance

Apr 28 '21 09:04 TaiSHiNet

Hi folks, is this still an issue you're facing?

Mar 25 '22 10:03 adejanovski

Yes it is, I've not been able to find a solution for it.

Mar 25 '22 10:03 larsiny

@LarsinYousif, the postgres backend has been removed in Reaper 3.x and it is highly unlikely that we'll come up with a fix for it. Is there any chance you could move to the Cassandra backend?

Mar 25 '22 11:03 adejanovski

cassandra-reaper cassandra-reaper copied to clipboard

Repair stuck - failed to take lead on node

cassandra-reaper
cassandra-reaper copied to clipboard