zeebe-chaos icon indicating copy to clipboard operation
zeebe-chaos copied to clipboard

Hypothesis: High CPU consumption on one Node will not break cluster

Open ChrisKujawa opened this issue 5 years ago • 4 comments

Hypothesis

We believe that high CPU consumption on one node, will not have affect on the cluster. It might cause leader election, but it should not break the cluster.

ChrisKujawa avatar Mar 04 '20 17:03 ChrisKujawa

Chaos experiment: We wanted to decrease the blast radius with only one partition, but we found an bug where this seemed not to be possible https://github.com/zeebe-io/zeebe/issues/4664

We run the experiment with 2 partitions and put really high CPU load on the Leader (internally in the pod), we expected that this will not impact the complete cluster. That at most we have a leader change because the leader is not able to send heartbeats in time. After removing the cpu load we should be back on our throughput base line, where we start and complete around 70 - 80 workflow instances per second.

The results where quite promising we had no leader change at all. The leader was able to send heartbeats in time and the backpressure did a good job and drop more requests. After reducing the cpu load we went back to our steady state.

result-chaos

ChrisKujawa avatar Jun 04 '20 14:06 ChrisKujawa

We manually tested it with multiple bash's opened in the pod k exec -it zeebe-chaos-zeebe-2 bash and running then inside the pod dd if=/dev/zero of=/dev/null.

ChrisKujawa avatar Jun 05 '20 07:06 ChrisKujawa

I used this script https://github.com/zeebe-io/zeebe-chaos/blob/tt-chaos/chaos-experiments/kubernetes/scripts/stress-cpu-broker.sh. Works pretty well but you have to install the command in the pod.

deepthidevaki avatar Jun 05 '20 09:06 deepthidevaki

With https://github.com/zeebe-io/zeebe-chaos/commit/bf4ee9fe4120720a28dec9e18036a0ec633f440d#diff-ad7069ad18a5117c98319d865ba9dca0 we added an automated experiment where we do the stress and then try to start instances afterwards. Next step would be to stress the cpu during the experiment and try to start instances, and stop stress cpu on rollback.

ChrisKujawa avatar Jun 18 '20 11:06 ChrisKujawa