AKS icon indicating copy to clipboard operation
AKS copied to clipboard

[BUG] calico-kube-controllers deployment is labeled twice with the CriticalAddonsOnly toleration

Open rgarcia89 opened this issue 1 year ago • 14 comments

Describe the bug On AKS clusters with calico enabled a namespace calico-system is created. Within that we can find a deployment calico-kube-controllers. This deployment is currently labels twice with the CriticalAddonsOnly toleration. This leads to an error in prometheus starting v2.52.0 as with that version a check for duplicate samples has been introduced.

       tolerations:
       - key: CriticalAddonsOnly # <- no 1
         operator: Exists
       - effect: NoSchedule
         key: node-role.kubernetes.io/master
       - effect: NoSchedule
         key: node-role.kubernetes.io/control-plane
       - key: CriticalAddonsOnly # <- no 2
         operator: Exists

The above situation leads to such a situation, as the kube-state-metrics pod creates the same metric twice - due to the second existens of the CriticalAddonsOnly toleration. I had created a issue on the prometheus project, as I was expecting it to be a prometheus issue, which it isn't. https://github.com/prometheus/prometheus/issues/14089

Prometheus log output

ts=2024-05-13T19:20:40.233Z caller=main.go:1372 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=95.860644ms db_storage=1.142µs remote_storage=150.634µs web_handler=872ns query_engine=776ns scrape=98.941µs scrape_sd=7.197985ms notify=13.095µs notify_sd=269.119µs rules=54.251368ms tracing=6.745µs
...
ts=2024-05-13T19:21:09.190Z caller=scrape.go:1777 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring/kube-state-metrics/0 target=https://10.244.5.6:8443/metrics msg="Duplicate sample for timestamp" series="kube_pod_tolerations{namespace=\"calico-system\",pod=\"calico-kube-controllers-75c647b46c-pg9cr\",uid=\"bf944c52-17bd-438b-bbf1-d97f8671bd6b\",key=\"CriticalAddonsOnly\",operator=\"Exists\"}"
ts=2024-05-13T19:21:09.207Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/monitoring/kube-state-metrics/0 target=https://10.244.5.6:8443/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=1

Environment (please complete the following information):

  • Kubernetes version 1.27.9

rgarcia89 avatar May 13 '24 19:05 rgarcia89

Same for v1.28.9

felixZdi avatar May 16 '24 09:05 felixZdi

@sabbour this validation could break the above mentioned deployment https://github.com/kubernetes/kubernetes/issues/124881

rgarcia89 avatar May 21 '24 09:05 rgarcia89

Any update on this?

idogada-akamai avatar Jun 10 '24 13:06 idogada-akamai

Would love to see this resolved, this is creating log spam and alerts on our prometheus stack due to duplicate labels.

Aaron-ML avatar Jun 11 '24 19:06 Aaron-ML

@Aaron-ML I am also using the kube-prometheus-stack and have downgraded prometheus to v2.51.2 until it is fixed...

rgarcia89 avatar Jun 12 '24 09:06 rgarcia89

@Aaron-ML I am also using the kube-prometheus-stack and have downgraded prometheus to v2.51.2 until it is fixed...

We've mitigated it for now by temporarily removing the alert related to prometheus ingest failures. Hopefully this gets resolved soon.

Aaron-ML avatar Jun 12 '24 17:06 Aaron-ML

@chasewilson any update available?

rgarcia89 avatar Jun 26 '24 14:06 rgarcia89

@chasewilson can you please provide an update?

bregtaca avatar Jul 16 '24 11:07 bregtaca

Any updates on this issue?

dsiperek-vendavo avatar Jul 18 '24 17:07 dsiperek-vendavo

@wedaly I know we'd investigated this. Could you add some clarity here?

chasewilson avatar Jul 18 '24 18:07 chasewilson

AKS creates the operator.tigera.io/v1 Installation resource that tells tigera-operator how to install Calico. In the installation CR, we're setting:

  controlPlaneTolerations:
  - key: CriticalAddonsOnly
    operator: Exists

tigera-operator code appends this to the list of default tolerations for calico-kube-controllers, which already includes this toleration: https://github.com/tigera/operator/blob/b01279889cd2a625fde862afb7b41e27b9dcce19/pkg/render/kubecontrollers/kube-controllers.go#L648

I don't know the full context of why AKS sets this field in the installation CR, but it's been this way for a long time (I think as long ago as 2021).

I'm not yet sure why we added that or if it's safe to remove, as I can see controlPlaneTolerations referenced elsewhere in tigera-operator. This needs a bit more investigation to verify that it's safe, but if so I think AKS could remove controlPlaneTolerations to address this bug.

wedaly avatar Jul 18 '24 19:07 wedaly

@wedaly in that case it is being added by AKS installation resource and the tigera-operator.

Your liked line indicates that there is a next to the passed config parameters also some meta data appended:

Tolerations:        append(c.cfg.Installation.ControlPlaneTolerations, rmeta.TolerateCriticalAddonsAndControlPlane...),

https://github.com/tigera/operator/blob/b01279889cd2a625fde862afb7b41e27b9dcce19/pkg/render/kubecontrollers/kube-controllers.go#L648

If you follow the path you can see that in the toleration is already defined there:

TolerateCriticalAddonsOnly = corev1.Toleration{
	Key:      "CriticalAddonsOnly",
	Operator: corev1.TolerationOpExists,
}

https://github.com/tigera/operator/blob/b01279889cd2a625fde862afb7b41e27b9dcce19/pkg/render/common/meta/meta.go#L56-L59

Therefore you should be good to remove it from the AKS installation resource.

rgarcia89 avatar Jul 18 '24 20:07 rgarcia89

Digging through the commit history in AKS, I see that the toleration was added as a repair item for a production issue during the migration to tigera-operator. The repair item is linked to this issue in GH: https://github.com/projectcalico/calico/issues/4525

However, I'm not sure how adding the toleration is related to the symptoms described in that issue. And all AKS clusters on supported k8s versions should be using tigera-operator now.

Seems like it should be safe to remove the toleration from the installation CR now.

wedaly avatar Jul 18 '24 21:07 wedaly

@wedaly any update here?

rgarcia89 avatar Jul 29 '24 06:07 rgarcia89

@chasewilson @wedaly can we please get an update? This is currently holding us back from being able to update Prometheus.

rgarcia89 avatar Aug 06 '24 11:08 rgarcia89

Apologies for the delayed response. The current plan is to remove controlPlaneTolerations from the installation CR to address this bug.

However, this change has the side-effect of adding two additional tolerations to Calico's typha deployment to tolerate every taint (https://github.com/tigera/operator/blob/8cbb161896a4ca641f885e668528cdb52de83f84/pkg/render/typha.go#L400). We believe this is safe, but any change like this carries some risk as it could affect many clusters.

For this reason, we are planning to remove controlPlaneTolerations only starting with the next Calico version released in AKS. This will be Calico 3.28 released in AKS k8s version 1.31, which will be previewed in September and generally available in October (schedule here).

I realize this doesn't provide an immediate solution to folks on earlier k8s versions that want to upgrade Prometheus, but we need to balance the severity of this bug against the risks of making a config change that would affect many AKS clusters.

wedaly avatar Aug 22 '24 21:08 wedaly

Hi, I can see the 1.31.1 Kubernetes version is available in preview on AKS. Image Can you confirm this fixes the Calico duplicate toleration ?

Nastaliss avatar Nov 12 '24 14:11 Nastaliss

Still waiting for a clarification if the duplicate tolerations will be fixed in an upcoming release of AKS. Can someone please comment?

tc-platform avatar Feb 24 '25 09:02 tc-platform

Will this land on a patch release for Azure Local AKS clusters? 1.31 seems quite some time out given the latest version is 1.29.x

xvzf avatar Mar 19 '25 14:03 xvzf

This is not stale

EraYaN avatar Apr 22 '25 12:04 EraYaN

This seems to have been rolled out, our 1.31 cluster has only one toleration on the deployment.

EraYaN avatar May 23 '25 07:05 EraYaN

Thanks for reaching out. I'm closing this issue as it was marked with "resolution/fix-released" and it hasn't had activity for 7 days.