Operator scale-up loop with recreate strategy
Describe the bug
During a deployment with the update strategy set to recreate in a 28-node cluster, the operator began increasing the number of replicas on the old StatefulSet instead of deleting it.
The update had a dual purpose: scaling up and changing a few settings.
We first scaled up (from 9:00:58 to 9:01:43), then changed the configurations around 9:05, which triggered the creation of a new StatefulSet (old being emqx-core-7c6fbb448d and new emqx-core-778c694444).
However, once the new StatefulSet was ready and nodes joined the cluster, the operator started increasing the replicas of the old StatefulSet (API server audit log events attached logs-insights-results.csv), kubectl get statefulset emqx-core-7c6fbb448d show 0 ready replicas and the requested were increasing.
As a workaround we scaled to 0 the operator and then scaled to 0 the old StatefulSet.
To Reproduce We did not manage to reproduce it in non production environments.
Expected behavior Instead of increasing the number of instances it should delete the old StatefulSet
Anything else we need to know?: By looking at the code we think that the way the operator calculate the number of replicas during scale down is wrong, here is the part where it calculates the number of replicas https://github.com/emqx/emqx-operator/blob/59bb002e25c92ace2c345950cb1e3395c57f5c83/controllers/apps/v2beta1/sync_pods.go#L95 and that number comes from the status instead of the replicas field https://github.com/emqx/emqx-operator/blob/59bb002e25c92ace2c345950cb1e3395c57f5c83/controllers/apps/v2beta1/update_emqx_status.go#L82-L102 the number of running replicas is the sum of the two StatefulSets. This creates a scale up instead of scale down.
Environment details::
- Kubernetes version: 1.25
- Cloud-provider/provisioner: AWS
- emqx-operator version: 2.2.23
- Install method: e.g. helm
Hi, @ddellarocca Thanks for feedback, I'm in Chinese New Year Holidays, I plan check this after 5th Feb.
Hi @Rory-Z, no worries we encoutered it in production but still we managed to do what we needed to do, it's ok to wait
Hi @ddellarocca I'm sorry I didn't repeat this issue in my local cluster, in https://github.com/emqx/emqx-operator/blob/59bb002e25c92ace2c345950cb1e3395c57f5c83/controllers/apps/v2beta1/update_emqx_status.go#L82-L102, the number of running replicas is not the sum of the two StatefulSets, it will check UID and controller UID is it match, so I think CurrentReplicas is right. And check your log file, the max number of the emqx-core-7c6fbb448d replicas is 93, it's more than the sum of the two StatefulSets.
Could you please running kubectl get sts emqx-core-7c6fbb448d -o yaml --show-managed-fields and check managedFields to make sure there is no other controller changed the replicas of this StatefulSets ?
Hi @Rory-Z, thanks for the check. On my side, I am not able to find the managedFields with kubectl get sts emqx-core-7c6fbb448d -o yaml --show-managed-fields. I can only find the apps.emqx.io/managed-by: emqx-operator attributes.
Regarding the 93 replicas, yes, it is more than the sum, but if you look at the log file, it got there gradually. I did miss to mention that those API calls to the API server were filtered by the service account used by the operator, so it was actually that which scaled up the cluster.
Is there anything that I could check?
I have no idea, I checked code again, and there is no other code to change the replicas of the statefulSet except https://github.com/emqx/emqx-operator/blob/59bb002e25c92ace2c345950cb1e3395c57f5c83/controllers/apps/v2beta1/update_emqx_status.go#L82-L102, so I create a PR https://github.com/emqx/emqx-operator/pull/1108, would you mind to help me to test this? I can out a beta version release is 2.2.28-beta.1
I could try to test it out next week but not in the cluster we faced the issue the first time, would that work out?
I could try to test it out next week but not in the cluster we faced the issue the first time, would that work out?
Thanks a lot, please try https://github.com/emqx/emqx-operator/releases/tag/2.2.29-beta.1
Hi @Rory-Z, sorry for not updating this topic been very busy in the meantime, however I saw that your change has been merged and published with version 2.2.29, I'm going to try that next week and let you know how it goes.
It seems to have fixed it! Thanks!