scylla-cluster-tests icon indicating copy to clipboard operation
scylla-cluster-tests copied to clipboard

fix(disrupt_terminate_and_replace_node): raise critical event on failure

Open cezarmoise opened this issue 8 months ago • 3 comments

If the nemesis cannot leave the cluster in the topological state it was before it should raise a critical error so the test can be stopped.

Add new event for topology failures TopologyFailureEvent.

refs: #9918

Testing

  • [ ]

PR pre-checks (self review)

  • [x] I added the relevant backport labels
  • [x] I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

cezarmoise avatar Mar 13 '25 14:03 cezarmoise

Why is it needed?

roydahan avatar Mar 13 '25 15:03 roydahan

Why is it needed?

By continuing, it leads to issues with other nemesis that affect topology, like GrowShrinkCluster.

cezarmoise avatar Mar 13 '25 19:03 cezarmoise

Another option is to refactor this nemesis to make sure we move to the part adding node, regardless of what was failing.

Either way, we can't accept a nemesis removing a node and not adding a new node

fruch avatar Mar 13 '25 21:03 fruch

@cezarmoise what is the future of this PR? Do you plan on continuing with this PR or can it be closed

pehala avatar May 09 '25 08:05 pehala