cassandra-reaper
cassandra-reaper copied to clipboard
Repair stuck - failed to take lead on node
Spotted on 2.2.1 (4be9bfc)
I've had a repair running for a little over a day, and it's repaired 1075 / 1076 segments. The last segment seems to be stuck, with the following error in the logs:
INFO [2021-03-09 22:18:59,768] [x-cassandra-preprod:00000000-0000-0003-0000-000000000000] i.c.s.RepairRunner - Attempting to run new segment...
INFO [2021-03-09 22:19:01,671] [x-cassandra-preprod:00000000-0000-0003-0000-000000000000] i.c.s.RepairRunner - Next segment to run : 00000000-0000-03f6-0000-000000000000
INFO [2021-03-09 22:19:03,985] [x-cassandra-preprod:00000000-0000-0003-0000-000000000000] i.c.s.PostgresStorage - Failed to take lead on node x.x.x.x for segment 00000000-0000-03f6-0000-000000000000
Edit: Looking into more, it seems all repairs are stucking with the same message in the logs.
┆Issue is synchronized with this Jira Task by Unito ┆Issue Number: K8SSAND-229
ouch... Could you share the content of the running_repairs table after obfuscating the IPs (but keeping them consistent so we know which node is which).
Are there segments in running state or none?
After that, can you truncate that table and see if that unblocks the repairs?
The postgres locks rely on timestamps since we don't have TTLs there, maybe there's something wrong there.
Hi @adejanovski,
No, there are no segments currently running (apart from a repair on a different cluster that's currently not stuck).
Below is the output of running_repairs. Repair ID 7 is the only that's not stuck at the moment.
REAPER=> SELECT * FROM running_repairs ORDER BY last_heartbeat;
repair_id | node | reaper_instance_host | reaper_instance_id | segment_id | last_heartbeat
-----------+--------------+----------------------+----------------------+------------+-------------------------------
4 | 10.0.0.1 | 10.10.1.1 | -7852674672088628830 | 2528 | 2021-03-09 05:35:51.357303+00
4 | 10.0.0.2 | 10.10.1.1 | -7852674672088628830 | 2528 | 2021-03-09 05:35:51.357303+00
4 | 10.0.0.3 | 10.10.1.1 | -7852674672088628830 | 2528 | 2021-03-09 05:35:51.357303+00
4 | 10.0.0.4 | 10.10.1.1 | -7852674672088628830 | 2528 | 2021-03-09 05:35:51.357303+00
4 | 10.0.0.5 | 10.10.1.1 | -7852674672088628830 | 2528 | 2021-03-09 05:35:51.357303+00
4 | 10.0.0.6 | 10.10.1.1 | -7852674672088628830 | 2528 | 2021-03-09 05:35:51.357303+00
3 | 10.0.0.7 | 10.10.1.1 | -7852674672088628830 | 1014 | 2021-03-09 07:10:16.044101+00
3 | 10.0.0.8 | 10.10.1.1 | -7852674672088628830 | 1014 | 2021-03-09 07:10:16.044101+00
3 | 10.0.0.9 | 10.10.1.1 | -7852674672088628830 | 1014 | 2021-03-09 07:10:16.044101+00
3 | 10.0.0.10 | 10.10.1.1 | -7852674672088628830 | 1014 | 2021-03-09 07:10:16.044101+00
3 | 10.0.0.11 | 10.10.1.1 | -7852674672088628830 | 1014 | 2021-03-09 07:10:16.044101+00
3 | 10.0.0.12 | 10.10.1.1 | -7852674672088628830 | 1014 | 2021-03-09 07:10:16.044101+00
13 | 10.1.0.1 | 10.10.1.1 | -7852674672088628830 | 34802 | 2021-03-09 07:58:49.158439+00
13 | 10.1.0.2 | 10.10.1.1 | -7852674672088628830 | 34802 | 2021-03-09 07:58:49.158439+00
13 | 10.1.0.3 | 10.10.1.1 | -7852674672088628830 | 34802 | 2021-03-09 07:58:49.158439+00
13 | 10.1.0.4 | 10.10.1.1 | -7852674672088628830 | 30543 | 2021-03-09 08:20:27.999968+00
13 | 10.1.0.5 | 10.10.1.1 | -7852674672088628830 | 30543 | 2021-03-09 08:20:27.999968+00
13 | 10.1.0.6 | 10.10.1.1 | -7852674672088628830 | 30543 | 2021-03-09 08:20:27.999968+00
13 | 10.1.0.7 | 10.10.1.1 | -7852674672088628830 | 33930 | 2021-03-09 08:20:45.829408+00
13 | 10.1.0.8 | 10.10.1.1 | -7852674672088628830 | 33930 | 2021-03-09 08:20:45.829408+00
13 | 10.1.0.9 | 10.10.1.1 | -7852674672088628830 | 33930 | 2021-03-09 08:20:45.829408+00
13 | 10.1.0.10 | 10.10.1.1 | -7852674672088628830 | 37890 | 2021-03-09 09:28:10.67804+00
13 | 10.1.0.11 | 10.10.1.1 | -7852674672088628830 | 37890 | 2021-03-09 09:28:10.67804+00
13 | 10.1.0.12 | 10.10.1.1 | -7852674672088628830 | 37890 | 2021-03-09 09:28:10.67804+00
13 | 10.1.0.13 | 10.10.1.1 | -7852674672088628830 | 32071 | 2021-03-09 09:28:29.467767+00
13 | 10.1.0.14 | 10.10.1.1 | -7852674672088628830 | 32071 | 2021-03-09 09:28:29.467767+00
13 | 10.1.0.15 | 10.10.1.1 | -7852674672088628830 | 32071 | 2021-03-09 09:28:29.467767+00
13 | 10.1.0.16 | 10.10.1.1 | -7852674672088628830 | 33390 | 2021-03-09 10:56:01.861937+00
13 | 10.1.0.17 | 10.10.1.1 | -7852674672088628830 | 33390 | 2021-03-09 10:56:01.861937+00
13 | 10.1.0.18 | 10.10.1.1 | -7852674672088628830 | 33390 | 2021-03-09 10:56:01.861937+00
13 | 10.1.0.19 | 10.10.1.1 | -7852674672088628830 | 35382 | 2021-03-09 16:56:51.154691+00
13 | 10.1.0.20 | 10.10.1.1 | -7852674672088628830 | 35382 | 2021-03-09 16:56:51.154691+00
13 | 10.1.0.21 | 10.10.1.1 | -7852674672088628830 | 35382 | 2021-03-09 16:56:51.154691+00
7 | 10.2.0.1 | 10.10.1.1 | -6792121109870000602 | 6555 | 2021-03-10 07:18:13.117629+00
7 | 10.2.0.2 | 10.10.1.1 | -6792121109870000602 | 6555 | 2021-03-10 07:18:13.117629+00
7 | 10.2.0.3 | 10.10.1.1 | -6792121109870000602 | 6555 | 2021-03-10 07:18:13.117629+00
7 | 10.2.0.4 | 10.10.1.1 | -6792121109870000602 | 4598 | 2021-03-10 07:18:15.774315+00
7 | 10.2.0.5 | 10.10.1.1 | -6792121109870000602 | 4598 | 2021-03-10 07:18:15.774315+00
7 | 10.2.0.6 | 10.10.1.1 | -6792121109870000602 | 4598 | 2021-03-10 07:18:15.774315+00
7 | 10.2.0.7 | 10.10.1.1 | -6792121109870000602 | 4917 | 2021-03-10 07:18:47.051861+00
7 | 10.2.0.8 | 10.10.1.1 | -6792121109870000602 | 4917 | 2021-03-10 07:18:47.051861+00
7 | 10.2.0.9 | 10.10.1.1 | -6792121109870000602 | 4917 | 2021-03-10 07:18:47.051861+00
7 | 10.2.0.10 | 10.10.1.1 | -6792121109870000602 | 5767 | 2021-03-10 07:18:58.3956+00
7 | 10.2.0.11 | 10.10.1.1 | -6792121109870000602 | 5767 | 2021-03-10 07:18:58.3956+00
7 | 10.2.0.12 | 10.10.1.1 | -6792121109870000602 | 5767 | 2021-03-10 07:18:58.3956+00
(45 rows)
Truncating the running_repairs table unblocked the repairs and some segments seem to be completing, but I'm still seeing some errors in the logs:
INFO [2021-03-10 07:35:04,697] [x:00000000-0000-0007-0000-000000000000:00000000-0000-1910-0000-000000000000] i.c.s.PostgresStorage - Failed to take lead on node x.x.x.x for segment 00000000-0000-1910-0000-000000000000
INFO [2021-03-10 07:35:01,010] [x:00000000-0000-000d-0000-000000000000:00000000-0000-8445-0000-000000000000] i.c.s.PostgresStorage - Failed renewing lead on segment 00000000-0000-8445-0000-000000000000 for node x.x.x.x
ERROR [2021-03-10 07:36:29,868] [x:00000000-0000-0007-0000-000000000000:00000000-0000-190e-0000-000000000000] i.c.s.RepairRunner - Executing SegmentRunner failed
java.lang.AssertionError: non-leader trying to update repair segment 00000000-0000-190e-0000-000000000000 of run 00000000-0000-0007-0000-000000000000
at io.cassandrareaper.storage.PostgresStorage.updateRepairSegment(PostgresStorage.java:419)
at io.cassandrareaper.service.SegmentRunner.postpone(SegmentRunner.java:198)
at io.cassandrareaper.service.SegmentRunner.abort(SegmentRunner.java:213)
at io.cassandrareaper.service.SegmentRunner.abort(SegmentRunner.java:226)
at io.cassandrareaper.service.SegmentRunner.processTriggeredSegment(SegmentRunner.java:415)
at io.cassandrareaper.service.SegmentRunner.runRepair(SegmentRunner.java:315)
at io.cassandrareaper.service.SegmentRunner.run(SegmentRunner.java:161)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:117)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:38)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:241)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Error messages could be related to the truncation as we removed the existing locks. Still blocked repairs like you've seen should never happen and need to be thoroughly investigated.
The java.lang.AssertionError error shows up in the logs every time the last segment in the repair is retried:
INFO [2021-03-10 09:04:57,090] [x:00000000-0000-0003-0000-000000000000:00000000-0000-03f6-0000-000000000000] i.c.s.SegmentRunner - Repair command 20565 on segment 00000000-0000-03f6-0000-000000000000 returned with state NOT_STARTED
INFO [2021-03-10 09:04:57,090] [x:00000000-0000-0003-0000-000000000000:00000000-0000-03f6-0000-000000000000] i.c.s.SegmentRunner - Repair command 20565 on segment 00000000-0000-03f6-0000-000000000000 never managed to start within timeout.
INFO [2021-03-10 09:04:57,090] [x:00000000-0000-0003-0000-000000000000:00000000-0000-03f6-0000-000000000000] i.c.s.SegmentRunner - Postponing segment 00000000-0000-03f6-0000-000000000000
INFO [2021-03-10 09:04:57,091] [x:00000000-0000-0003-0000-000000000000:00000000-0000-03f6-0000-000000000000] i.c.s.PostgresStorage - Failed renewing lead on segment 00000000-0000-03f6-0000-000000000000 for node x.x.x.x
ERROR [2021-03-10 09:04:57,091] [x:00000000-0000-0003-0000-000000000000:00000000-0000-03f6-0000-000000000000] i.c.s.PostgresStorage - Failed to renew lead on segment 00000000-0000-0003-0000-000000000000
ERROR [2021-03-10 09:04:57,092] [x:00000000-0000-0003-0000-000000000000:00000000-0000-03f6-0000-000000000000] i.c.s.RepairRunner - Executing SegmentRunner failed
java.lang.AssertionError: non-leader trying to update repair segment 00000000-0000-03f6-0000-000000000000 of run 00000000-0000-0003-0000-000000000000
at io.cassandrareaper.storage.PostgresStorage.updateRepairSegment(PostgresStorage.java:419)
at io.cassandrareaper.service.SegmentRunner.postpone(SegmentRunner.java:198)
at io.cassandrareaper.service.SegmentRunner.abort(SegmentRunner.java:213)
at io.cassandrareaper.service.SegmentRunner.abort(SegmentRunner.java:226)
at io.cassandrareaper.service.SegmentRunner.processTriggeredSegment(SegmentRunner.java:415)
at io.cassandrareaper.service.SegmentRunner.runRepair(SegmentRunner.java:315)
at io.cassandrareaper.service.SegmentRunner.run(SegmentRunner.java:161)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:117)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:38)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:241)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Hi, I'm facing the same issue, what information could aid in the resolution of this ticket?
Thanks in advance
Hi folks, is this still an issue you're facing?
Yes it is, I've not been able to find a solution for it.
@LarsinYousif, the postgres backend has been removed in Reaper 3.x and it is highly unlikely that we'll come up with a fix for it. Is there any chance you could move to the Cassandra backend?