emqx-operator icon indicating copy to clipboard operation
emqx-operator copied to clipboard

Operator scale-up loop with recreate strategy

Open ddellarocca opened this issue 11 months ago • 7 comments

Describe the bug During a deployment with the update strategy set to recreate in a 28-node cluster, the operator began increasing the number of replicas on the old StatefulSet instead of deleting it. The update had a dual purpose: scaling up and changing a few settings. We first scaled up (from 9:00:58 to 9:01:43), then changed the configurations around 9:05, which triggered the creation of a new StatefulSet (old being emqx-core-7c6fbb448d and new emqx-core-778c694444). However, once the new StatefulSet was ready and nodes joined the cluster, the operator started increasing the replicas of the old StatefulSet (API server audit log events attached logs-insights-results.csv), kubectl get statefulset emqx-core-7c6fbb448d show 0 ready replicas and the requested were increasing.

As a workaround we scaled to 0 the operator and then scaled to 0 the old StatefulSet.

To Reproduce We did not manage to reproduce it in non production environments.

Expected behavior Instead of increasing the number of instances it should delete the old StatefulSet

Anything else we need to know?: By looking at the code we think that the way the operator calculate the number of replicas during scale down is wrong, here is the part where it calculates the number of replicas https://github.com/emqx/emqx-operator/blob/59bb002e25c92ace2c345950cb1e3395c57f5c83/controllers/apps/v2beta1/sync_pods.go#L95 and that number comes from the status instead of the replicas field https://github.com/emqx/emqx-operator/blob/59bb002e25c92ace2c345950cb1e3395c57f5c83/controllers/apps/v2beta1/update_emqx_status.go#L82-L102 the number of running replicas is the sum of the two StatefulSets. This creates a scale up instead of scale down.

Environment details::

  • Kubernetes version: 1.25
  • Cloud-provider/provisioner: AWS
  • emqx-operator version: 2.2.23
  • Install method: e.g. helm

ddellarocca avatar Jan 30 '25 09:01 ddellarocca

Hi, @ddellarocca Thanks for feedback, I'm in Chinese New Year Holidays, I plan check this after 5th Feb.

Rory-Z avatar Jan 31 '25 18:01 Rory-Z

Hi @Rory-Z, no worries we encoutered it in production but still we managed to do what we needed to do, it's ok to wait

ddellarocca avatar Feb 01 '25 17:02 ddellarocca

Hi @ddellarocca I'm sorry I didn't repeat this issue in my local cluster, in https://github.com/emqx/emqx-operator/blob/59bb002e25c92ace2c345950cb1e3395c57f5c83/controllers/apps/v2beta1/update_emqx_status.go#L82-L102, the number of running replicas is not the sum of the two StatefulSets, it will check UID and controller UID is it match, so I think CurrentReplicas is right. And check your log file, the max number of the emqx-core-7c6fbb448d replicas is 93, it's more than the sum of the two StatefulSets.

Could you please running kubectl get sts emqx-core-7c6fbb448d -o yaml --show-managed-fields and check managedFields to make sure there is no other controller changed the replicas of this StatefulSets ?

Rory-Z avatar Feb 06 '25 07:02 Rory-Z

Hi @Rory-Z, thanks for the check. On my side, I am not able to find the managedFields with kubectl get sts emqx-core-7c6fbb448d -o yaml --show-managed-fields. I can only find the apps.emqx.io/managed-by: emqx-operator attributes.

Regarding the 93 replicas, yes, it is more than the sum, but if you look at the log file, it got there gradually. I did miss to mention that those API calls to the API server were filtered by the service account used by the operator, so it was actually that which scaled up the cluster.

Is there anything that I could check?

ddellarocca avatar Feb 06 '25 09:02 ddellarocca

I have no idea, I checked code again, and there is no other code to change the replicas of the statefulSet except https://github.com/emqx/emqx-operator/blob/59bb002e25c92ace2c345950cb1e3395c57f5c83/controllers/apps/v2beta1/update_emqx_status.go#L82-L102, so I create a PR https://github.com/emqx/emqx-operator/pull/1108, would you mind to help me to test this? I can out a beta version release is 2.2.28-beta.1

Rory-Z avatar Feb 07 '25 02:02 Rory-Z

I could try to test it out next week but not in the cluster we faced the issue the first time, would that work out?

ddellarocca avatar Feb 10 '25 10:02 ddellarocca

I could try to test it out next week but not in the cluster we faced the issue the first time, would that work out?

Thanks a lot, please try https://github.com/emqx/emqx-operator/releases/tag/2.2.29-beta.1

Rory-Z avatar Feb 11 '25 05:02 Rory-Z

Hi @Rory-Z, sorry for not updating this topic been very busy in the meantime, however I saw that your change has been merged and published with version 2.2.29, I'm going to try that next week and let you know how it goes.

ddellarocca avatar Jun 19 '25 12:06 ddellarocca

It seems to have fixed it! Thanks!

ddellarocca avatar Jun 25 '25 14:06 ddellarocca