redpanda Operator: Scaling up a cluster triggers rolling restart

Version & Environment

Redpanda version: (use rpk version): v22.3.1-rc4

What went wrong?

Increasing the cluster replicas triggers a rolling restart, which means that new Redpanda pods get scheduled on existing pods' nodes. E.g. in an N node cluster scaled up to M nodes:

Operator triggers a rolling restart
Pod 0 is deleted (restarted)
Pod N+1 is scheduled in Node 0
Pod 0, which has an affinity for Node 0, becomes unschedulable

What should have happened instead?

Existing pods shouldn't be restarted, and new pods should be scheduled on available nodes.

How to reproduce the issue?

Deploy an N-broker Redpanda cluster on an M-node k8s cluster using the operator.
Edit the cluster CR, increasing the replicas from N to M
Monitor the pods in the redpanda namespace (kubectl get pods -n redpanda -w)
Watch a rolling restart be attempted, with a new pod being scheduled on node 0, and then pod 0 becoming unschedulable due to a persistent volume conflict.

Additional information

Please attach any relevant logs, backtraces, or metric charts. Deleting the pod scheduled on pod 0's node allows the rolling restart to continue, but of course a pod inevitably becomes unschedulable in the end:

redpanda@ip-172-16-1-162:~$ kubectl get po -n redpanda -w
NAME                                       READY   STATUS        RESTARTS   AGE
rp-juan-1111-0                             0/1     Pending       0          22m
rp-juan-1111-1                             1/1     Running       0          93m
rp-juan-1111-2                             1/1     Running       0          93m
rp-juan-1111-3                             1/1     Terminating   0          79m
sasl-user-creation-first-superuser-pvwv7   0/1     Completed     0          93m
rp-juan-1111-3                             0/1     Terminating   0          79m
rp-juan-1111-3                             0/1     Terminating   0          79m
rp-juan-1111-3                             0/1     Terminating   0          79m
rp-juan-1111-3                             0/1     Pending       0          0s
rp-juan-1111-3                             0/1     Pending       0          0s
rp-juan-1111-0                             0/1     Pending       0          22m
rp-juan-1111-0                             0/1     Init:0/1      0          22m
rp-juan-1111-0                             0/1     PodInitializing   0          22m

Nov 16 '22 14:11 0x5d

There's a beginning of a fix here: https://github.com/redpanda-data/redpanda/pull/4964

Nov 16 '22 16:11 0x5d

This doesn't happen now.

Aug 21 '23 18:08 joejulian

redpanda redpanda copied to clipboard

Operator: Scaling up a cluster triggers rolling restart

Version & Environment

What went wrong?

What should have happened instead?

How to reproduce the issue?

Additional information

redpanda
redpanda copied to clipboard