zeebe-chaos
zeebe-chaos copied to clipboard
Hypothesis: High CPU consumption on one Node will not break cluster
Hypothesis
We believe that high CPU consumption on one node, will not have affect on the cluster. It might cause leader election, but it should not break the cluster.
Chaos experiment: We wanted to decrease the blast radius with only one partition, but we found an bug where this seemed not to be possible https://github.com/zeebe-io/zeebe/issues/4664
We run the experiment with 2 partitions and put really high CPU load on the Leader (internally in the pod), we expected that this will not impact the complete cluster. That at most we have a leader change because the leader is not able to send heartbeats in time. After removing the cpu load we should be back on our throughput base line, where we start and complete around 70 - 80 workflow instances per second.
The results where quite promising we had no leader change at all. The leader was able to send heartbeats in time and the backpressure did a good job and drop more requests. After reducing the cpu load we went back to our steady state.

We manually tested it with multiple bash's opened in the pod k exec -it zeebe-chaos-zeebe-2 bash and running then inside the pod dd if=/dev/zero of=/dev/null.
I used this script https://github.com/zeebe-io/zeebe-chaos/blob/tt-chaos/chaos-experiments/kubernetes/scripts/stress-cpu-broker.sh. Works pretty well but you have to install the command in the pod.
With https://github.com/zeebe-io/zeebe-chaos/commit/bf4ee9fe4120720a28dec9e18036a0ec633f440d#diff-ad7069ad18a5117c98319d865ba9dca0 we added an automated experiment where we do the stress and then try to start instances afterwards. Next step would be to stress the cpu during the experiment and try to start instances, and stop stress cpu on rollback.