scylla-cluster-tests Feature zero token nodes

New feature Zero token nodes - were presented by https://github.com/scylladb/scylladb/pull/19684. These nodes are normal scylla node but they don't joined to token ring of cluster and don't contain any user data. They are used by raft and group0 to preserve majority for raft operations Zero token nodes could be created only upon bootstrap if scylla.yaml contains join_ring: false. Zero token nodes are normal scylla nodes and all operations are allowed with them (remove, decommission, replace with zero token nodes only) the main goal of them:

preserve quorum and majority in multi dc config. ex dc1: 3 : dc2: 3 - if one dc die or dc become isolated, both dc will lost majority and no any topology or schema operation will be allowed, if will be add one more dc with 1 zero token node dc3: 1, then majority will be saved and
because zero token nodes stay as group0 voters, even single dc cluster: 4 nodes, +1 zero token nodes could save cluster majority and even if 2 nodes are die, schema changes and topology operations will be allowed.

zero token node doesn't contain user data, therefore the instance type could be smaller Drivers don't see zerotoken nodes and ignore them.

To add support zero token nodes in sct:

[ ] Add special flag to Node class which will true if node is zero token
[ ] Add new sct_config parameter to - number of zero token nodes - add additional dc with zero token nodes only
[ ] init cluster with zero token nodes: here we have 2 ways:
1. - [ ] using new parameter, mark latest num_zero_token in list as zero token nodes This solution has several benefits:
  - don't need to fix provisioner code, all backends will be supported (aws, gce, docker, k8s??)
  - if num_zero_tokens = 0, nothing will be changed
  - simple to implement
  - just add new test with correct yaml configuration and disadvantages:
- zero token nodes will have same instance type as normal db nodes - in case test will not have a lot of them, it will not be a problem
- if enable by default even 1 zero token node in config, all config yaml files should be reviewed so number of normal nodes was relevant to RF of schemas, because if test is conifgured with 3 nodes and one of them 1 zero token node and rf =3, after provision, cluster will have 2 normal nodes and 1 zero token nodes. This could affect on c-s and other stress tools.
1. [ ] - Add code to Provisioner (aws, azure) and to gce, docker, k8s?? and provision correct number of nodes n_db_nodes + nemesis_add_nodes+num_zero_token_nodes. Also add new sct config parameter which allow to set small instance type for zero token nodes this solution has next benefits:
- instance type could be set independently for normal nodes and zero token nodes.
- No need to review and fix all test config yamls Disadvantages:
- quite complex code refactor: need to fix Provisioner and all related stuff, add new code for gce, docker (we don't have provisioner for them yet, am i right?)
- Add additional checks for instance type, so not set small instance as normal node
- others, not yet found them )
[ ] Add support for cluster operation and Disrupt Nemesis:
- bootstrap new node as zero token
- if decommission zero token node and new zero token nodes
- replace zero token node with zero token node only
- remove zero token node and if after that add new node it should be zero token node
[ ] Cluster status and health Currently node tool status doesn't display zero-token node https://github.com/scylladb/scylladb/issues/19849. Once it will be fixed, all zero token nodes will be displayed in output as usual nodes. But before that, it is need work around so cluster health and nodetool status correctly work and don't fail the test.
[ ] New nemesis:
- [ ] Create new DC with zero token node only (could be improved current one)
- [ ] Multi DC (2 dc) create new DC with zero-token node only and stop all nodes in 1 dc, check that majority is not lost ( another variations is 3 dc initiated, and nemesis check that we have 3 DCes and 1 dc has only zerotoken node and only then run, otherwise it skipped)

Testing

The pr is already contains first way cluster initialization. Job done. Node with zero token is added. But due to initializtion https://github.com/scylladb/scylladb/issues/19849, cluster health is failed and job marked as failed.

Implementation 1

PR pre-checks (self review)

[ ] I added the relevant backport labels
[ ] I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

Sep 03 '24 09:09 aleksbykov

@soyacz @fruch @roydahan @vponomaryov @temichus need your opinion for implementation.

Sep 03 '24 09:09 aleksbykov

New changes are ready:

Removed unused sct_config variables. Now only 2 variables are used: n_db_zero_token_nodes and zero_token_instance_type_db. n_db_zero_token_nodes has same format as n_db_nodes: int or "0 0 1" which means that we should have 3 dc, and znode will be only in 3rd one. zero_token_instance_type_db - instance type of znodes. if it is not set, then db_instance_type will be used.
Add to sct aws provisioner functionality to create instances for znodes with specified instnace type. znode instance will have tag ZeroTokenNode=True, which to detect the znode by sct and correctly configure it in scylla.yaml
sct (aws part only) - can detect znode and mark them with new property BaseNode._zero_token_node. Also we can create znode with aws cluster(added appropriate parameter to add_nodes() method
BaseScyllaCluster has 2 new properties: data_node and zero_nodes. Each return list for relevant nodes. BaseScyllaClluster.nodes was not renamed. Because it require a lot of changes in sct code. BaseScyllaClluster.nodes contains full list of nodes(datanodes and znodes). This is also done, because znode are regular scylla node(without data) and most of the operations could be start from them
Nemsis.get_target_node switched for now to get target node only from datanode pool.
Update all Nemesis and sct operation to use specified pool of nodes is in progress
Added 2 new nemesises:
- Bootstrap/Decommission znode to cluster (singleDC and MultiDC)
- Add new remove new dc with znode only (singleDC), For multiDC is in progress
Add workaround for health_validator and wait_node_up to get status of nodes from system.cluster_status table instead of nodetool status
Added 2 new jobs single dc and multidc with znode configured.
- both job have passed with regular nemesis

Oct 10 '24 07:10 aleksbykov

I would recommend adding issues for follow on other backends, we can help with docker/gce/azure, and cloud/operator team with k8s backend, they would need it as well at some point.

Oct 10 '24 18:10 fruch

Fixed according comments. New:

Added new global flag 'use_zero_nodes' which allow to enabled/disable support of zero nodes in sct. If it is disabled, sct will configure and use only data nodes, all nemesis related to zero nodes only will be skipped, all nemesis which could work with any nodes will work only with data_nodes.
Nemesis Class has new attribute 'target_node_pool' and method set_target_node_pool. This allow to explicitly configure the pool of nodes for choosing target node for nemesis. Several nemesis marked with decorator, which explicitly set target_node_pool. by default data_nodes set as target_node_pool
Mark several disrupt_* method to use appropriate pool
Go through sct code, and set to use data_nodes instead of all nodes (self.nodes).
Add appropirate flag to zero node related yaml and jobs.
2 testing jobs are added and will be used for testing zero nodes, all other nodes will continue to work without zero nodes for a while.
Added new nemesis flag 'zero_node_changes' which allow to select nemesis which could work with zero nodes.

Oct 25 '24 17:10 aleksbykov

@patjed41 @soyacz , can you review again?

Oct 25 '24 17:10 aleksbykov

@soyacz , fixed by your comments. Can you take a look

Oct 28 '24 17:10 aleksbykov

I took a look at risky places (possibly not all of them) and got to this:

to fix:

sdcm.cluster.BaseCluster.run_node_benchmarks - should be run on data nodes

Fixed

sdcm.cluster.BaseScyllaCluster.get_rack_nodes - used in k8s context, replaces db node with data node.

i set it as is, because k8s is not support yet. I thinks this better to fix when k8s support with zero nodes will be added Now for K8S self.nodes == self.data_nodes

longevity_test.LongevityTest.test_custom_time - in case of cluster_target_size is set, it shouldn't take into account zero nodes I believe

Set to use data_nodes for counting

places where we use adaptive_timeout - should operate on data nodes. fix places where we used cluster.nodes[0]

fixed

longevity_test.LongevityTest._run_validate_large_collections_in_system and longevity_test.LongevityTest._run_validate_large_collections_warning_in_logs should operate on data nodes I think

Fixed.

longevity_test.LongevityTest.all_node_ips_for_stress_command - also should operate on data nodes I think

zero nodes could be query coordinator, and driver filter them out, but set to data_nodes for safety

longevity_tombstone_gc_test.TombstoneGcLongevityTest._run_repair_and_major_compaction - please verify

switch to data_nodes

sdcm.cluster.BaseScyllaCluster.start_kms_key_rotation_thread should be adjusted to use data nodes I think

it is related to scylla config, so it is better to run on all nodes

sdcm.nemesis.Nemesis.disrupt_load_and_stream - verify

Fixed. set to use data nodes explicitly

SstableLoadUtils.distribute_test_files_to_cluster_nodes

It uses the passed node list, so fix above should also fix this one

disrupt_nodetool_refresh in shards_num = self.cluster.nodes[0].scylla_shards can take zero node and return wrong value, also I don't think this disruption should upload sstables to all nodes

fixed

disrupt_nodetool_enospc - this nemesis might not work with zeros, same disrupt_end_of_quota_nemesis

set data_nodes

check_that_sstables_are_encrypted - this will fail

Set to use only data nodes. it should work. i will tirgger job to check it

_switch_to_network_replication_strategy - nodes_by_region take Zero nodes into account

set to use data nodes

disrupt_add_remove_dc - also needs fixing in couple places

updated with data nodes

disrupt_add_remove_mv - should use data nodes as targed I think (btw. will sdcm.utils.nemesis_utils.indexes.wait_for_view_to_be_built work?)

yes, it should work

sdcm/nemesis.py:5302 - nemesis wrapper - we verify if num_nodes_before != num_nodes_after: this should verify if we have the same number of zero nodes and data nodes I think

added checks for both

sdcm.scan_operation_thread.FullscanOperationBase possibly also should base on data nodes

yep, fixed

sdcm.tombstone_gc_verification_thread.TombstoneGcVerificationThread - also should work with data nodes (validates sstables)

set to data_nodes

questions:

Generally, shouldn't all db cql sessions be created with data nodes?

no, zero node could be query coordinator and it send query request to required data node

Shouldn't all stress threads base on data nodes too?

Same as above. Dirvers will automatically ignore zero nodes even if they set in -nodes. The only limitation is for multidc, for DC with zero nodes only should be explicit set replication factor: 0. Auto resize rf is not yet working. Issue was opened for that

sdcm.cluster.BaseScyllaCluster.set_seeds - can zero node be seed?

yes. The only limitation for zero node, it have not to be first node incluster, but in sct zero nodes added in last order

sdcm.cluster.BaseScyllaCluster.wait_for_schema_agreement - is schema agreement check same on zero nodes?

yes, schema should be same on all nodes. It is supported by raft

sdcm.cluster.BaseScyllaCluster.wait_all_nodes_un uses check_nodes_up_and_normal which bases on nodetool status, will it work?

i added workaround while issue is not fixed. Status will be checked from system tables and generate same dict as nodetool status

sdcm.cluster.BaseScyllaCluster.verify_decommission also uses nodetool status under the hood

same

sdcm.cluster.BaseScyllaCluster.get_db_nodes_cpu_mode - is perftune yaml always present on zero nodes?

for now zero nodes always run on supported by scylla instances, and perftune is present. in next we could changed.

sdcm/commit_log_check_thread.py - verify how zero nodes should be counted here, possibly the same, but I'm not sure.

Looks same.

Do we need scylla manager agent on zero nodes?

issue opened for scylla manger zero node support. But i think yes

Do we need special treatment of zero nodes in upgrade tests? - support for upgrade 1. test can be done in followup task.

it will be next tasks. From upgrade perspective noting should be changed, and zero node upgraded as usual

Does nodetool cleanup work on Zeros?

yes, finished very fast

Won't disrupt_validate_hh_short_downtime fail when validating un with nodetool on all nodes

workaround should fixed it

Won't sdcm.cluster.BaseScyllaCluster.get_node_status_dictionary fail if Zero node is used?

workaround should help

sdcm.utils.compaction_ops.CompactionOps - should work with data nodes

switched

to check:

sdcm.cluster.BaseScyllaCluster.get_node - it is used for nosqlbench as target and other places, see if shouldn't pin to data nodes

Set to use data_nodes

sdcm.kafka.kafka_cluster.LocalKafkaCluster - doesn't it break kafka connector?

it shouldn't.

Validate SLA tests (possibly another followup task)

Yes it will be another task, for now Enterprise 2024.2 is not support zero nodes yet

Same alternator tests (possibly another followup task)

Yes it will be check with another task by enabling flag for appropriate job

Oct 30 '24 09:10 aleksbykov

sdcm.cluster.BaseScyllaCluster.start_kms_key_rotation_thread should be adjusted to use data nodes I think

it is related to scylla config, so it is better to run on all nodes

but inside it verifies sstables of user keyspaces - this will fail. This part needs to be adjusted.

to check:

sdcm.cluster.BaseScyllaCluster.get_node - it is used for nosqlbench as target and other places, see if shouldn't pin to data nodes

Set to use data_nodes

In that case, if drivers ignore zero nodes, this change is not required

sdcm.kafka.kafka_cluster.LocalKafkaCluster - doesn't it break kafka connector?

it shouldn't.

Worth to schedule a test for it (can be done later, and raise scylla issue if it failed)

More notes: I didn't verify if you indeed fixed all my noted places - too hard, so make sure you pushed all changes. see where we use nodetool status and adjust places use it - I might miss something. Also you can also double check all the places where cluster.nodes or self.nodes is used - I did it, but could miss some places as it's quite mundane work.

Let us know when you finish it and is ready for merge. Also rerun all the tests after recent updates. Include 1-2 artifact tests if don't break as this is showstopper for Scylla master and we don't want to affect that (they run on GCE, so it will verify if something required was missing there). Run also one Azure test (can be artifact too).

Oct 30 '24 10:10 soyacz

sdcm.cluster.BaseScyllaCluster.start_kms_key_rotation_thread should be adjusted to use data nodes I think

it is related to scylla config, so it is better to run on all nodes

but inside it verifies sstables of user keyspaces - this will fail. This part needs to be adjusted.

Yep, i found what you mean. Fixed to use data nodes

to check:

sdcm.cluster.BaseScyllaCluster.get_node - it is used for nosqlbench as target and other places, see if shouldn't pin to data nodes

Set to use data_nodes

In that case, if drivers ignore zero nodes, this change is not required

reverted.

sdcm.kafka.kafka_cluster.LocalKafkaCluster - doesn't it break kafka connector?

it shouldn't.

Worth to schedule a test for it (can be done later, and raise scylla issue if it failed)

Add it to plan. Can you point me to kafka job?

More notes: I didn't verify if you indeed fixed all my noted places - too hard, so make sure you pushed all changes. see where we use nodetool status and adjust places use it - I might miss something. Also you can also double check all the places where cluster.nodes or self.nodes is used - I did it, but could miss some places as it's quite mundane work.

i ve done, but also could miss something. anyway, for now only several specific jobs will be run with enabled zero tokens, By default they will be disabled and most of the jobs should work as expected. i will run all jobs with enabling zero tokens one by one to minimize possible crashes

Let us know when you finish it and is ready for merge. Also rerun all the tesits after recent updates. Include 1-2 artifact tests if don't break as this is showstopper for Scylla master and we don't want to affect that (they run on GCE, so it will verify if something required was missing there). Run also one Azure test (can be artifact too).

i will update job results here:

Oct 30 '24 12:10 aleksbykov

sdcm.kafka.kafka_cluster.LocalKafkaCluster - doesn't it break kafka connector?

it shouldn't.

Worth to schedule a test for it (can be done later, and raise scylla issue if it failed)

Add it to plan. Can you point me to kafka job?

https://jenkins.scylladb.com/view/master/job/scylla-master/job/kafka_connectors/ , but speak with @dimakr as recently new tests were added but I think we miss still jenkins jobs for it.

More notes: I didn't verify if you indeed fixed all my noted places - too hard, so make sure you pushed all changes. see where we use nodetool status and adjust places use it - I might miss something. Also you can also double check all the places where cluster.nodes or self.nodes is used - I did it, but could miss some places as it's quite mundane work.

i ve done, but also could miss something. anyway, for now only several specific jobs will be run with enabled zero tokens, By default they will be disabled and most of the jobs should work as expected. i will run all jobs with enabling zero tokens one by one to minimize possible crashes

Sure, just minimizing number of problems - we have many nemesis and better to pass as many as possible in the first run.

Oct 30 '24 13:10 soyacz

https://jenkins.scylladb.com/view/master/job/scylla-master/job/kafka_connectors/ , but speak with @dimakr as recently new tests were added but I think we miss still jenkins jobs for it.

Right, we have jobs for kafka source connector (scylla > kafka direction of changes replication). The job for kafka sink connector (kafka > scylla direction) is not there yet, but the test itself can be easily triggered using the existing longevity jobs and kafka sink connector specific test config.

Oct 30 '24 14:10 dimakr

All jobs run with disabled zero nodes supports but with new nodes poool: .nodes = all nodes .data_nodes = only data nods .zero_nodes = only zero nodes

and new nemesis target_node_pool which is set exlicitly to use data_nodes as target

ArtifactsJobs:

Passed - https://jenkins.scylladb.com/job/scylla-staging/job/abykov/job/artifacts-ubuntu-2204-test/7
Passed - https://jenkins.scylladb.com/job/scylla-staging/job/abykov/job/artifacts-azure-image-test/8
FAILED - https://jenkins.scylladb.com/job/scylla-staging/job/abykov/job/artifacts-amazon2-test/2 - Failed with same error as in master

Invidual Nemesis, Passed (from PR point of view): Job marked as failed. But errors found in teardown. no errors in code or nemesis related to PR changes:

https://jenkins.scylladb.com/view/master/job/scylla-master/job/abykov-nemesis-individuals/job/longevity-5gb-1h-AbortRepairMonkey-aws-test/
https://jenkins.scylladb.com/view/master/job/scylla-master/job/abykov-nemesis-individuals/job/longevity-5gb-1h-AddDropColumnMonkey-aws-test/
https://jenkins.scylladb.com/view/master/job/scylla-master/job/abykov-nemesis-individuals/job/longevity-5gb-1h-AddRemoveDcNemesis-aws-test/
https://jenkins.scylladb.com/view/master/job/scylla-master/job/abykov-nemesis-individuals/job/longevity-5gb-1h-AddRemoveMvNemesis-aws-test/
https://jenkins.scylladb.com/view/master/job/scylla-master/job/abykov-nemesis-individuals/job/longevity-5gb-1h-ClusterRollingRestart-aws-test/

Regular jobs:

https://jenkins.scylladb.com/job/scylla-staging/job/abykov/job/longevity-3h-azure-test/5 (failed by spot termination) nemesis passed, no code errors
https://jenkins.scylladb.com/job/scylla-staging/job/abykov/job/longevity-3h-10gb-gce-test/6 (failed with cs errors on teardown) nemesis are passed, no code errors
https://jenkins.scylladb.com/job/scylla-staging/job/abykov/job/longevity-50gb-3days-test/5 (failed but not related to pr changes)

Jobs with zero nodes support enabled and nemesis which run with zero nodes and :

https://jenkins.scylladb.com/job/scylla-staging/job/abykov/job/longevity-multi-dc-rack-aware-zero-token-dc/37 - Job is passed. Found only log errors and c-s errors
https://jenkins.scylladb.com/job/scylla-staging/job/abykov/job/longevity-4h-100gb-zero-nodes-larger-than-data-nodes/2 - Passed
https://jenkins.scylladb.com/job/scylla-staging/job/abykov/job/longevity-100gb-4h-zero-token-nodes/87/

Oct 31 '24 09:10 aleksbykov

@soyacz can you take a look

Oct 31 '24 09:10 aleksbykov

@soyacz , i removed workaround. i think PR is ready to be merged, and it have chance to run on weekend for all jobs. ( if will be something broken, to be reverted on Monday )

Nov 01 '24 10:11 aleksbykov

scylla-cluster-tests scylla-cluster-tests copied to clipboard

Feature zero token nodes

Testing

PR pre-checks (self review)

Reminders

scylla-cluster-tests
scylla-cluster-tests copied to clipboard