datadog-agent datadog-cluster-agent-metrics-api failing with FailedDiscoveryCheck seemingly randomly

datadog-cluster-agent-metrics-api failing with FailedDiscoveryCheck seemingly randomly

Open bcha opened this issue 3 years ago • 7 comments

Output of the info page (if this is a bug)

(Paste the output of the info page here)

Describe what happened: We're relying on datadog external metrics to autoscale some of our applications and recently we've been noticing some weird cluster-agent behavior. 99,9% of the time it works as intented, but sometimes at seemingly random the datadog-cluster-agent-metrics-api goes unavailable with FailedDiscoveryCheck.

❯ k get apiservice v1beta1.external.metrics.k8s.io
NAME                              SERVICE                                     AVAILABLE                      AGE
v1beta1.external.metrics.k8s.io   datadog/datadog-cluster-agent-metrics-api   False (FailedDiscoveryCheck)   148d

There's nothing in kube events when this happens. I'm not seeing anything too suspicious in cluster-agent logs. Container reports being totally healthy so no automatic restart. Logs & Metrics & everything continue working normally. But everything relying on the apiservice like the HPAs stop working correctly.

After restarting the cluster-agent it starts responding normally again:

❯ k get apiservice v1beta1.external.metrics.k8s.io             
NAME                              SERVICE                                     AVAILABLE   AGE
v1beta1.external.metrics.k8s.io   datadog/datadog-cluster-agent-metrics-api   True        148d

Describe what you expected: Based on this & the fact that restart solves the issue I'd maybe expect cluster-agent health check to fail if the apiservice fails. Not sure if that's a good idea or easily doable.

Steps to reproduce the issue: Again, I'm not sure how to reproduce or debug this really as it happens seemingly at random. We've seen it working 100% fine for weeks or even months before randomly failing at some point.

Additional environment details (Operating System, Cloud provider, etc): On AWS EKS kube Server Version: v1.19.8-eks-96780e datadog cluster agent version: 1.12.0 (installed with https://github.com/DataDog/helm-charts/tree/master/charts/datadog)

Jun 30 '21 09:06 bcha

datadog-agent datadog-agent copied to clipboard

datadog-cluster-agent-metrics-api failing with FailedDiscoveryCheck seemingly randomly

datadog-agent
datadog-agent copied to clipboard