scylla-cluster-tests icon indicating copy to clipboard operation
scylla-cluster-tests copied to clipboard

`GrowShrinkClusterNemesis` fails when current number of nodes is less the initial number of nodes

Open cezarmoise opened this issue 10 months ago • 7 comments

_grow_cluster allways adds a fixed number of nodes (nemesis_add_node_cnt). but _shrink_cluster removes a relative number of nodes decommission_nodes_number = min(cur_num_nodes_in_dc - initial_db_size, add_nodes_number)

In this example, https://argus.scylladb.com/tests/scylla-cluster-tests/0b0e042d-60a7-4dad-832d-4e38f2e5a5e9, when the nemesis ran, there were 4 nodes running, but the inital number of nodes was 5. So grow went from 4 to 5, but shrink decided there was nothing to remove, since 5-5=0 and threw and error.

A solution could be to make grow add more nodes if the current number of nodes is less than the initial number of nodes.

cezarmoise avatar Jan 27 '25 13:01 cezarmoise

those happen cause other nemesis that change topology failed

2025-01-24 08:14:41.747: (DisruptionEvent Severity.ERROR) period_type=end event_id=52869c52-b0e4-4b65-b6d0-e29c8f4e7ae0 duration=8m17s: nemesis_name=TerminateAndReplaceNode target_node=Node longevity-parallel-topology-schema--db-node-0b0e042d-6 [54.83.0.13 | 10.12.10.16] errors=Removed node state should be DN
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5309, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1583, in disrupt_terminate_and_replace_node
self._terminate_and_replace_node()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1602, in _terminate_and_replace_node
assert get_node_state(old_node_ip) == "DN", "Removed node state should be DN"
AssertionError: Removed node state should be DN

the bug isn't with grow-shrink, but with that one. nemesis should strive to put the state as it was before (topology wise), and if they can't test should be stopped.

nemesis shouldn't compensate like that for other nemesis

fruch avatar Feb 24 '25 08:02 fruch

So the test should fail on that assert, instead of continuing?

cezarmoise avatar Feb 24 '25 14:02 cezarmoise

So the test should fail on that assert, instead of continuing?

should try to recover cluster state to initial one, only if not possible, fail whole test.

soyacz avatar Feb 24 '25 16:02 soyacz

disrupt_terminate_and_replace_node need to be fixed to raise critical error on any case it can't do the replacement

fruch avatar Mar 13 '25 11:03 fruch

I don't understand what was the issue here. The terminate nemesis node couldn't remove the node?

It shouldn't raise a critical.

roydahan avatar Mar 13 '25 15:03 roydahan

I don't understand what was the issue here. The terminate nemesis node couldn't remove the node?

It shouldn't raise a critical.

If a nemesis terminates a node by doesn't add one back, it should stop the test

Also the logic of grow shrink nemesis can be adapted to work base on the current situation, and not based the numbers from configuration (cause it might leave the cluster in a none working state)

fruch avatar Mar 13 '25 21:03 fruch

ok, now I understand. It's fine to fail it as critical.

roydahan avatar Mar 17 '25 13:03 roydahan