cassandra icon indicating copy to clipboard operation
cassandra copied to clipboard

DSP-23003 Port batchlog_endpoint_strategy config

Open shunsaker opened this issue 1 year ago • 2 comments

Port bug: DB-1367 - Add batchlog_endpoint_strategy config to improve batchlog endpoint selection

Jira https://datastax.jira.com/browse/DSP-23003

Compare to original DSE commit https://github.com/jasonstack/apollo/commit/29fecd1f49d2bb539da573786f93fd9b0ba205c4#diff-03e0ee2c01f2ca4d3b15d3a1fc4917d8737861c15d8c83cb73b352fe9da22d1a

This change set is different from that patch in that the changes take place in ReplicaPlans.java instead of batchlogManger.java to simplify future merge conflicts.

Adds original batchlog_endpoint_strategy options:

  • random_remote (default)
  • dynamic_remote
  • dynamic

while also keeping compatibility with Allow to filter batchlog endpoints by preferring the local rack via existing perferLocal boolean parameter and also a new batchlog_endpoint_strategy: prefer_local

shunsaker avatar Apr 09 '24 13:04 shunsaker

Confirmed failures are false alarms, see cassci bot comment https://github.com/datastax/cassandra/commit/09acaf9e026b657999cda4ca51cc1a4f6a2e20f0#commitcomment-140825091

shunsaker avatar Apr 15 '24 15:04 shunsaker

@djatnieks thanks for the reivew, I appreciate the nits! I'll probably get back to this next week. We've discussed trying to get this into OSS, but it seems complicated. They have an open ticket that hasn't had any activity in a year https://issues.apache.org/jira/browse/CASSANDRA-18120. The main hurdle is probably the fact that our solution uses DynamicSnitch and it's common practice to disable dynamic snitching, so it might not be a beneficial solution for OSS

shunsaker avatar May 01 '24 19:05 shunsaker

LGTM.

A few minor concerns:

  • the dse. property prefix (i don't have an answer or concrete opinion to this though), see comment,
  • have we done multiplexing tests on this ?
  • have you tried rebasing this patch to cassandra's trunk, and cassandra-5.0 ? (I don't see TCM touching any of this stuff, but i'd want to double-check that – it might influence how we/you want the patch done in earlier branches…)

michaelsembwever avatar May 23 '24 18:05 michaelsembwever

Thanks @michaelsembwever! Planning on getting this ready for merge today. I'll rebase, update CassandraRelevantProperties, rename the property, and double check it applies cleanly to cassandra's trunk, and cassandra-5.0. I'm not aware of any multiplexing tests. With this being my first CC PR I'm not sure which tests I need to run before merging

shunsaker avatar May 28 '24 13:05 shunsaker

FYI: you have some conflicts to fix in your branch as it has conflicts with the main branch

roxananeo avatar May 30 '24 11:05 roxananeo

Pushed to address comments

  1. Rebased and resolved conflicts. PR was originally developed on ds-trunk instead of vsearch...
  2. Added REQUIRED_BATCHLOG_REPLICA_COUNT to CassandraRelevantProperties
  3. and renamed the property from dse.batchlog.required_replica_count to cassandra.batchlog.required_replica_count
  4. renamed BatchlogEndpointStrategy.dynamicSnitch to BatchlogEndpointStrategy.useDynamicSnitchScores
  5. tweaked BatchlogEndpointStrategy comments in config.java

shunsaker avatar May 30 '24 13:05 shunsaker

Latest run shows 1 new test failure https://jenkins-stargazer.aws.dsinternal.org/job/ds-cassandra-pr-gate/job/PR-1080/10/testReport/ Waiting for rerun to complete before merging https://jenkins-stargazer.aws.dsinternal.org/job/ds-cassandra-pr-gate/job/PR-1080/11/

shunsaker avatar May 30 '24 18:05 shunsaker

Run 11 completed with no failures

shunsaker avatar May 30 '24 19:05 shunsaker

Thanks Shayne!

jeromatron avatar May 30 '24 19:05 jeromatron