redpanda-operator Orchestrate scaling down Redpanda resource

trafficstars

If user would scale down more than (N/2 + 1) where N is the replication factor, then Redpanda will lost Raft quorum and it will be unable to serve any decommission request. Operator should handle this gracefully by scaling down using (N/2+1) formula and wait for the full decommission of old nodes.

JIRA Link: K8S-197

Mar 12 '24 14:03 RafalKorepta

A few comments:

Do we need to store somewhere the original size of the cluster. Cluster info can show you which nodes are down, and can be used to keep track of scaling down both the replicas and the number of nodes needed in a quorum right? I think this is where Redpanda is a bit special since you can decommission nodes, effectively changing this formula. So, what do we mean by scaling down here? Do we mean literally just scaling but not necessarily changing the number of nodes commissioned?
Are we ok with using validatingWebhooks? This would still require knowing how many nodes are comissioned.

Mar 21 '24 13:03 alejandroEsc

Scaling here would be changing the number of active brokers in the redpanda cluster.

IMO we shouldn't need to keep track of the original. We can always measure the active number of brokers with RPK or kubectl queries. Once the Spec has been updated, the operator should only focus on reconciling that. Rollbacks would need a lot more work.

Are we ok with using validatingWebhooks? This would still require knowing how many nodes are comissioned.

I'm a bit split on this one. I like not duplicating logic when possible. We could instead rely on redpanda to not decommission nodes that it can't and instead push the responsibility back onto the users?

Mar 21 '24 14:03 chrisseto

One more thing, quorum is lost if there is less than (N+1)/2 nodes, and we can tolerate at most up to (N-1)/2 failures. Which means if we have (N+1)/2 failures then we have lost quorum and we can no longer read write.

That said, we lose quorum if we replicate down to (N+1)/2 -1 = (N-1)/2 nodes. This i what I will be using.

Mar 25 '24 13:03 alejandroEsc

NIT

That said, we lose quorum if we replicate down to (N+1)/2 -1 = (N-1)/2 nodes.

We can afford to lose Flor( (N-1)/2 )

Other than that I agree.

Mar 25 '24 13:03 RafalKorepta

Scope has changed a bit, now we want to also mantain quorum of topic partitions.

Mar 27 '24 13:03 alejandroEsc

Addressed in https://github.com/redpanda-data/redpanda-operator/pull/102

Mar 28 '24 16:03 alejandroEsc

For this ticket, what we will do for now is adding the quorum validation check we have in the current PR. I will create a new ticket discussing the issue we should be fixing which is scaling down in a controlled fashion. We should probably do this once we move away from flux.

Mar 28 '24 16:03 alejandroEsc

After some testing, i am checking to see if there is a quick and simple win where we cannot scale below the min replication factor.

Apr 01 '24 20:04 alejandroEsc

redpanda-operator redpanda-operator copied to clipboard

Orchestrate scaling down Redpanda resource

redpanda-operator
redpanda-operator copied to clipboard