cassandra-reaper icon indicating copy to clipboard operation
cassandra-reaper copied to clipboard

Incremental Repair Stalls on Cassandra 4

Open ReillyBrogan opened this issue 1 year ago • 3 comments

We recently upgraded our Cassandra clusters to 4.0.4 (from 3.11). We attempted to switch our repair schedules over to incremental given it's now the recommended repair method in 4.0 however we are seeing that incremental repairs appear to stall in Cassandra 4.0 in our environment.

The main symptom of this is that repairs seem to go on as normal until they just "stop" with no apparent error message in Cassandra or Reaper. These repairs persist indefinitely until they are manually aborted in Reaper.

Our environment: Cassandra 4.0.4 running under the eclipse-temurin 11.0.15_10-jre JVM (Ubuntu Jammy container) Reaper 3.1.1 running as a sidecard (also tested with container built from git master) Three nodes spread across different AWS AZs (but configured to be part of the same Cassandra datacenter)

Repairs were created with the following settings:

  • Targeting a single keyspace (glowroot for our example)
  • Incremental TRUE
  • Adaptive TRUE (not sure if this does anything with incremental repairs)
  • Concurrency Parallel
  • Thread count 4

Cassandra system logs (these started recording just before the repair was started, and go until about thirty minutes after the last entry about repair is mentioned, at which point the repair was assumed to be stalled as we've never seen a repair recover after that point): cassandra-0.log cassandra-1.log cassandra-2.log

Reaper logs are essentially just this on every reaper container (repeated until the repair is canceled) with no actual errors:

INFO   [2022-07-08 20:55:33,093] [cassandra:4ecfb200-ff00-11ec-a134-1b2df3c07864] i.c.s.RepairRunner - Attempting to run new segment... 
INFO   [2022-07-08 20:55:33,252] [cassandra:4ecfb200-ff00-11ec-a134-1b2df3c07864] i.c.s.RepairRunner - Next segment to run : 4ed04e40-ff00-11ec-a134-1b2df3c07864 
INFO   [2022-07-08 20:56:12,181] [cassandra:4ecfb200-ff00-11ec-a134-1b2df3c07864] i.c.s.RepairRunner - Attempting to run new segment... 
INFO   [2022-07-08 20:56:12,341] [cassandra:4ecfb200-ff00-11ec-a134-1b2df3c07864] i.c.s.RepairRunner - Next segment to run : 4ed07550-ff00-11ec-a134-1b2df3c07864 
INFO   [2022-07-08 20:56:32,667] [SchedulingManagerTimer] i.c.s.SchedulingManager - Repair schedule '5147f450-fe58-11ec-97bf-51955d7aec44' is paused 
INFO   [2022-07-08 20:56:47,678] [cassandra:4ecfb200-ff00-11ec-a134-1b2df3c07864] i.c.s.RepairRunner - Attempting to run new segment... 
INFO   [2022-07-08 20:56:47,775] [cassandra:4ecfb200-ff00-11ec-a134-1b2df3c07864] i.c.s.RepairRunner - Next segment to run : 4ed07550-ff00-11ec-a134-1b2df3c07864 
INFO   [2022-07-08 20:57:26,639] [cassandra:4ecfb200-ff00-11ec-a134-1b2df3c07864] i.c.s.RepairRunner - Attempting to run new segment... 
INFO   [2022-07-08 20:57:26,764] [cassandra:4ecfb200-ff00-11ec-a134-1b2df3c07864] i.c.s.RepairRunner - Next segment to run : 4ed04e40-ff00-11ec-a134-1b2df3c07864 
INFO   [2022-07-08 20:57:32,741] [SchedulingManagerTimer] i.c.s.SchedulingManager - Repair schedule '5147f450-fe58-11ec-97bf-51955d7aec44' is paused 
INFO   [2022-07-08 20:58:06,369] [cassandra:4ecfb200-ff00-11ec-a134-1b2df3c07864] i.c.s.RepairRunner - Attempting to run new segment... 
INFO   [2022-07-08 20:58:06,439] [cassandra:4ecfb200-ff00-11ec-a134-1b2df3c07864] i.c.s.RepairRunner - Next segment to run : 4ed04e40-ff00-11ec-a134-1b2df3c07864 
INFO   [2022-07-08 20:58:32,667] [SchedulingManagerTimer] i.c.s.SchedulingManager - Repair schedule '5147f450-fe58-11ec-97bf-51955d7aec44' is paused 
INFO   [2022-07-08 20:58:45,853] [cassandra:4ecfb200-ff00-11ec-a134-1b2df3c07864] i.c.s.RepairRunner - Attempting to run new segment... 
INFO   [2022-07-08 20:58:45,864] [cassandra:4ecfb200-ff00-11ec-a134-1b2df3c07864] i.c.s.RepairRunner - Next segment to run : 4ed07550-ff00-11ec-a134-1b2df3c07864 
INFO   [2022-07-08 20:59:19,979] [cassandra:4ecfb200-ff00-11ec-a134-1b2df3c07864] i.c.s.RepairRunner - Attempting to run new segment... 

Nodetool info for each node

Pod cassandra-0
ID                     : 63f8aa38-06b4-44cd-92a2-37e7074e0455
Gossip active          : true
Native Transport active: true
Load                   : 1.89 GiB
Generation No          : 1657313463
Uptime (seconds)       : 3813
Heap Memory (MB)       : 476.88 / 5120.00
Off Heap Memory (MB)   : 208.66
Data Center            : datacenter1
Rack                   : rack1
Exceptions             : 0
Key Cache              : entries 28550, size 4.37 MiB, capacity 100 MiB, 236768 hits, 272199 requests, 0.870 recent hit rate, 14400 save period in seconds
Row Cache              : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache          : entries 0, size 0 bytes, capacity 50 MiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Percent Repaired       : 89.51556470655855%
Token                  : (invoke with -T/--tokens to see all 256 tokens)

Pod cassandra-1
ID                     : 22809334-476a-443a-927a-caeadde6147f
Gossip active          : true
Native Transport active: true
Load                   : 1.9 GiB
Generation No          : 1657313162
Uptime (seconds)       : 4068
Heap Memory (MB)       : 3115.04 / 5120.00
Off Heap Memory (MB)   : 88.50
Data Center            : datacenter1
Rack                   : rack1
Exceptions             : 0
Key Cache              : entries 28357, size 4.36 MiB, capacity 100 MiB, 305962 hits, 337363 requests, 0.907 recent hit rate, 14400 save period in seconds
Row Cache              : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache          : entries 0, size 0 bytes, capacity 50 MiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Percent Repaired       : 89.70026130667003%
Token                  : (invoke with -T/--tokens to see all 256 tokens)

Pod cassandra-2
ID                     : a14dc982-362d-4c68-90d1-cb7915a6f8df
Gossip active          : true
Native Transport active: true
Load                   : 1.9 GiB
Generation No          : 1657312920
Uptime (seconds)       : 4322
Heap Memory (MB)       : 2890.04 / 5120.00
Off Heap Memory (MB)   : 102.07
Data Center            : datacenter1
Rack                   : rack1
Exceptions             : 0
Key Cache              : entries 29388, size 4.46 MiB, capacity 100 MiB, 298952 hits, 331132 requests, 0.903 recent hit rate, 14400 save period in seconds
Row Cache              : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache          : entries 0, size 0 bytes, capacity 50 MiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Percent Repaired       : 90.64904204290283%
Token                  : (invoke with -T/--tokens to see all 256 tokens)

Nodetool compactionstats and netstats appear normal to me. repair_admin shows the repair as "Running" with the last activity set to roughly not long after it started.

┆Issue is synchronized with this Jira Task by Unito ┆friendlyId: K8SSAND-1637 ┆priority: Medium

ReillyBrogan avatar Jul 08 '22 22:07 ReillyBrogan

Oh, I note that full repairs do not appear to have any issue and complete just as they did on 3.11. Incremental repairs do seem to work for a while if all tables in the keyspace are rebuilt with nodetool scrub or nodetool upgradesstables but the issue crops up again after a while (and obviously those are not sustainable solutions)

ReillyBrogan avatar Jul 08 '22 22:07 ReillyBrogan

I was able to reproduce the issue in K8ssandra with pods changing IPs after restarting. In the case of incremental repair, we do not re-compute the IP for each segment like we do for full repairs: https://github.com/thelastpickle/cassandra-reaper/blob/master/src/server/src/main/java/io/cassandrareaper/service/RepairRunner.java#L469-L471

Instead, we use the coordinator host that was computed when the repair run was created. This should hopefully be easy enough to fix.

This would mean that scrubbing/upgrading sstables wouldn't fix the problem and that recreating a new repair does for now.

adejanovski avatar Jul 21 '22 13:07 adejanovski

It's likely then that I misunderstood what was fixing the issue. I can also help test a PR for this if you or someone else creates one.

ReillyBrogan avatar Jul 22 '22 17:07 ReillyBrogan