solr-operator icon indicating copy to clipboard operation
solr-operator copied to clipboard

GKE Autopilot constant re-deployment of pods

Open hrvolapeter opened this issue 2 years ago • 4 comments

I'm running solr operator on GKE Autopilot clust with example from here - the example requires to round-up resource requests to values supported by autopilot. However, operator keeps re-deploying deployments due to change in resource requests, but the 'to' and 'from' valus are exactly the same. This bug is best observable with prometheus exporter which is redeployed couple times per second, the same is happening with cluster but less often - seem like every couple minutes.

Logs, seem like relevant code is this - note the to and from clauses

controller-runtime.manager.controller.solrprometheusexporter Update required because field changed {"reconciler group": "solr.apache.org", "reconciler kind": "SolrPrometheusExporter", "name": "explore-prom-exporter", "namespace": "default", "deployment": "explore-prom-exporter-solr-metrics", "kind": "deployment", "field": "Spec.Template.Spec.Containers[0].Resources", "from": {"limits":{"cpu":"250m","ephemeral-storage":"1Gi","memory":"512Mi"},"requests":{"cpu":"250m","ephemeral-storage":"1Gi","memory":"512Mi"}}, "to": {"limits":{"cpu":"250m","ephemeral-storage":"1Gi","memory":"512Mi"},"requests":{"cpu":"250m","ephemeral-storage":"1Gi","memory":"512Mi"}}}
Error
2022-01-13T21:46:02.884596420Z2022-01-13T21:46:02.884Z INFO controller-runtime.manager.controller.solrcloud Update required because field changed {"reconciler group": "solr.apache.org", "reconciler kind": "SolrCloud", "name": "explore", "namespace": "default", "zookeeperCluster": "explore-solrcloud-zookeeper", "kind": "zookeeperCluster", "field": "Spec.Pod.Resources", "from": {"limits":{"cpu":"250m","ephemeral-storage":"512Mi","memory":"500Mi"},"requests":{"cpu":"250m","ephemeral-storage":"512Mi","memory":"500Mi"}}, "to": {"limits":{"cpu":"250m","ephemeral-storage":"512Mi","memory":"500Mi"},"requests":{"cpu":"250m","ephemeral-storage":"512Mi","memory":"500Mi"}}}

hrvolapeter avatar Jan 14 '22 07:01 hrvolapeter

So the Autopilot re-writes the resources for the pod correct? Are there any other fields we should stop syncing/updating if autopilot is enabled?

HoustonPutman avatar Jan 18 '22 16:01 HoustonPutman

As far as I know only the resources are overwritten by Autopilot.

Autopilot requires requests / limits divisible by 512Mi for ram, and there are similar requirements for disk and cpu as well, however what I found more surprising even if you follow requirements the solr operator keeps redeploying the pods even if old and updated resource requests by autopilot are exactly the same. I've added the diff in the description

hrvolapeter avatar Jan 19 '22 16:01 hrvolapeter

Ok so given your logs, the first thing to fix is that the operator isn't treating the same resource amounts as actually equal. This isn't actually related to GKE Autopilot, but it should be fixing that part of your system. We will need to introduce a separate PR to not actually sync resources for the pods. (Note, I don't think this will work with the Zookeeper pods managed by the Zookeeper Operator).

To be clear, you are using the VerticalPodAutoscaler that is a part of the GKE Autopilot correct? That's what is changing the resource amounts for the pods?

HoustonPutman avatar Jan 19 '22 18:01 HoustonPutman

Hi @HoustonPutman thanks for taking a look.

I'm not using VerticalPodAutoscaler at least not knowingly, however the autopilot changing resources is based on this documentation. Whenever the resources outside of limit / allowed combination autopilot changes it to the closest allowed value. It's definitely possible to figure right values on your own - but even then I hit the issues where equal diffs are not treated as equal

hrvolapeter avatar Jan 24 '22 19:01 hrvolapeter