datadog-agent
datadog-agent copied to clipboard
datadog-cluster-agent-metrics-api failing with FailedDiscoveryCheck seemingly randomly
Output of the info page (if this is a bug)
(Paste the output of the info page here)
Describe what happened: We're relying on datadog external metrics to autoscale some of our applications and recently we've been noticing some weird cluster-agent behavior. 99,9% of the time it works as intented, but sometimes at seemingly random the datadog-cluster-agent-metrics-api goes unavailable with FailedDiscoveryCheck.
❯ k get apiservice v1beta1.external.metrics.k8s.io
NAME SERVICE AVAILABLE AGE
v1beta1.external.metrics.k8s.io datadog/datadog-cluster-agent-metrics-api False (FailedDiscoveryCheck) 148d
There's nothing in kube events when this happens. I'm not seeing anything too suspicious in cluster-agent logs. Container reports being totally healthy so no automatic restart. Logs & Metrics & everything continue working normally. But everything relying on the apiservice like the HPAs stop working correctly.
After restarting the cluster-agent it starts responding normally again:
❯ k get apiservice v1beta1.external.metrics.k8s.io
NAME SERVICE AVAILABLE AGE
v1beta1.external.metrics.k8s.io datadog/datadog-cluster-agent-metrics-api True 148d
Describe what you expected: Based on this & the fact that restart solves the issue I'd maybe expect cluster-agent health check to fail if the apiservice fails. Not sure if that's a good idea or easily doable.
Steps to reproduce the issue: Again, I'm not sure how to reproduce or debug this really as it happens seemingly at random. We've seen it working 100% fine for weeks or even months before randomly failing at some point.
Additional environment details (Operating System, Cloud provider, etc): On AWS EKS kube Server Version: v1.19.8-eks-96780e datadog cluster agent version: 1.12.0 (installed with https://github.com/DataDog/helm-charts/tree/master/charts/datadog)