scylla-cluster-tests
scylla-cluster-tests copied to clipboard
`GrowShrinkClusterNemesis` fails when current number of nodes is less the initial number of nodes
_grow_cluster allways adds a fixed number of nodes (nemesis_add_node_cnt).
but _shrink_cluster removes a relative number of nodes decommission_nodes_number = min(cur_num_nodes_in_dc - initial_db_size, add_nodes_number)
In this example, https://argus.scylladb.com/tests/scylla-cluster-tests/0b0e042d-60a7-4dad-832d-4e38f2e5a5e9, when the nemesis ran, there were 4 nodes running, but the inital number of nodes was 5.
So grow went from 4 to 5, but shrink decided there was nothing to remove, since 5-5=0 and threw and error.
A solution could be to make grow add more nodes if the current number of nodes is less than the initial number of nodes.
those happen cause other nemesis that change topology failed
2025-01-24 08:14:41.747: (DisruptionEvent Severity.ERROR) period_type=end event_id=52869c52-b0e4-4b65-b6d0-e29c8f4e7ae0 duration=8m17s: nemesis_name=TerminateAndReplaceNode target_node=Node longevity-parallel-topology-schema--db-node-0b0e042d-6 [54.83.0.13 | 10.12.10.16] errors=Removed node state should be DN
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5309, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1583, in disrupt_terminate_and_replace_node
self._terminate_and_replace_node()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1602, in _terminate_and_replace_node
assert get_node_state(old_node_ip) == "DN", "Removed node state should be DN"
AssertionError: Removed node state should be DN
the bug isn't with grow-shrink, but with that one. nemesis should strive to put the state as it was before (topology wise), and if they can't test should be stopped.
nemesis shouldn't compensate like that for other nemesis
So the test should fail on that assert, instead of continuing?
So the test should fail on that assert, instead of continuing?
should try to recover cluster state to initial one, only if not possible, fail whole test.
disrupt_terminate_and_replace_node need to be fixed
to raise critical error on any case it can't do the replacement
I don't understand what was the issue here. The terminate nemesis node couldn't remove the node?
It shouldn't raise a critical.
I don't understand what was the issue here. The terminate nemesis node couldn't remove the node?
It shouldn't raise a critical.
If a nemesis terminates a node by doesn't add one back, it should stop the test
Also the logic of grow shrink nemesis can be adapted to work base on the current situation, and not based the numbers from configuration (cause it might leave the cluster in a none working state)
ok, now I understand. It's fine to fail it as critical.