cassandra icon indicating copy to clipboard operation
cassandra copied to clipboard

CEP-37 on Trunk

Open jaydeepkumar1984 opened this issue 1 year ago • 1 comments

CEP: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution

Design doc: https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit?tab=t.0#heading=h.r112r46toau0

jaydeepkumar1984 avatar Oct 03 '24 20:10 jaydeepkumar1984

@masokol , @emolsson , @itskarlsson , @tommystendahl , @etedpet , @jwaeab , @VictorCavichioli , @SajidRiaz138 , @ch1bbe , @ArcturusMengsk , @DanielwEriksson , @manmagic3

as contributors to https://github.com/Ericsson/ecchronos we would very much appreciate any last-minute-pre-CEP-vote review to this PR for Cassandra's new repair solution. (more technical/code review will continue post-CEP vote.)

CEP: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution

Design doc: https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit?tab=t.0#heading=h.r112r46toau0

michaelsembwever avatar Oct 19 '24 15:10 michaelsembwever

Hi,

I think this looks promising! I have a few points:

  • Combine ranges instead of splitting - in ecChronos we saw huge improvements in some scenarios (when the data is low or empty tables) compared to repairing 1 vnode at a time. This improvement scaled with amount of vnodes, although it might've been related to overhead due to running repairs through JMX. Not sure but might be worth investigating.
  • Major versions, during major version upgrades like 3 -> 4 we weren't supposed to run repairs. If Cassandra plans to keep this then it would be nice for repairs to automatically pause during major version upgrades.
  • Observabliity, i saw there're metrics but it would also be nice to see repair status with nodetool.
  • Repair priority per table, not per node.

masokol avatar Oct 21 '24 08:10 masokol

Thanks, @masokol, for the review! Please find my response here:

  • Combine ranges instead of splitting - in ecChronos we saw huge improvements in some scenarios (when the data is low or empty tables) compared to repairing 1 vnode at a time. This improvement scaled with amount of vnodes, although it might've been related to overhead due to running repairs through JMX. Not sure but might be worth investigating.
  1. Generally, empty tables or tables with a small amount of data runs through pretty fast, in seconds, so it is not a major issue for smaller/empty tables.
  2. The current framework already has support to combine ranges through a setting (src/java/org/apache/cassandra/repair/autorepair/AutoRepair.java:266)
  • Major versions, during major version upgrades like 3 -> 4 we weren't supposed to run repairs. If Cassandra plans to keep this then it would be nice for repairs to automatically pause during major version upgrades.

This is a great suggestion, and the framework can be enhanced easily - I just filed a new sub-ticket CASSANDRA-20013 to track this

  • Observabliity, i saw there're metrics but it would also be nice to see repair status with nodetool.

There is already a new nodetool command that would print the current status. Please take a look at it here.

  • Repair priority per table, not per node.

Currently, it will repair the tables randomly, but it can be enhanced to add a priority as a CQL table property that an end user can configure, which can also be enhanced easily. Just added this enhancement to the ticket mentioned above.

jaydeepkumar1984 avatar Oct 21 '24 16:10 jaydeepkumar1984

There are a few nits that I have added, but overall it looks good now, I am approving this PR from my side 👍

jaydeepkumar1984 avatar Feb 13 '25 20:02 jaydeepkumar1984

I a bit of time reviewing documentation and configuration today and have created CASSANDRA-20421 and https://github.com/jaydeepkumar1984/cassandra/pull/47 to improve the docs, configuration and to make the default configuration more conservative.

tolbertam avatar Mar 08 '25 07:03 tolbertam

Rebase on top of the latest trunk has been completed; all the unit tests are passing on CircleCI

Many dtests are failing on CircleCI due to resource limitations. For that, I have run them internally inside our infrastructure and most of them are passing as expected.

jaydeepkumar1984 avatar Apr 20 '25 17:04 jaydeepkumar1984