datadog-agent icon indicating copy to clipboard operation
datadog-agent copied to clipboard

datadog-cluster-agent-metrics-api failing with FailedDiscoveryCheck seemingly randomly

Open bcha opened this issue 4 years ago • 7 comments

Output of the info page (if this is a bug)

(Paste the output of the info page here)

Describe what happened: We're relying on datadog external metrics to autoscale some of our applications and recently we've been noticing some weird cluster-agent behavior. 99,9% of the time it works as intented, but sometimes at seemingly random the datadog-cluster-agent-metrics-api goes unavailable with FailedDiscoveryCheck.

❯ k get apiservice v1beta1.external.metrics.k8s.io
NAME                              SERVICE                                     AVAILABLE                      AGE
v1beta1.external.metrics.k8s.io   datadog/datadog-cluster-agent-metrics-api   False (FailedDiscoveryCheck)   148d

There's nothing in kube events when this happens. I'm not seeing anything too suspicious in cluster-agent logs. Container reports being totally healthy so no automatic restart. Logs & Metrics & everything continue working normally. But everything relying on the apiservice like the HPAs stop working correctly.

After restarting the cluster-agent it starts responding normally again:

❯ k get apiservice v1beta1.external.metrics.k8s.io             
NAME                              SERVICE                                     AVAILABLE   AGE
v1beta1.external.metrics.k8s.io   datadog/datadog-cluster-agent-metrics-api   True        148d

Describe what you expected: Based on this & the fact that restart solves the issue I'd maybe expect cluster-agent health check to fail if the apiservice fails. Not sure if that's a good idea or easily doable.

Steps to reproduce the issue: Again, I'm not sure how to reproduce or debug this really as it happens seemingly at random. We've seen it working 100% fine for weeks or even months before randomly failing at some point.

Additional environment details (Operating System, Cloud provider, etc): On AWS EKS kube Server Version: v1.19.8-eks-96780e datadog cluster agent version: 1.12.0 (installed with https://github.com/DataDog/helm-charts/tree/master/charts/datadog)

bcha avatar Jun 30 '21 09:06 bcha

My .02: Our cluster runs on GKE as a private cluster. After inspecting the corresponding api service, I could see a timeout message on port 8443:

➜ ~ kubectl describe apiservice v1beta1.external.metrics.k8s.io

[...]
Status:
  Conditions:
    Last Transition Time:  2021-04-14T07:29:52Z
    Message:               failing or missing response from https://172.24.2.187:8443/apis/external.metrics.k8s.io/v1beta1: Get "https://172.24.2.187:8443/apis/external.metrics.k8s.io/v1beta1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
    Reason:                FailedDiscoveryCheck
    Status:                False
    Type:                  Available

Opening up port 8443 for traffic originating from the master fixed this immediately in our case.

memphis88 avatar Sep 23 '21 07:09 memphis88

@memphis88 I'm running into the same problem, except I always get the timeout error even after a clean install of datadog agent and its cluster agent following the documentation.

I'm also on GKE 1.24.7-gke.900, can you please describe how you opened the 8443 port please?

gottfrois avatar Jan 11 '23 14:01 gottfrois

Actually I had to update the Service port from 8443 to use 443 instead otherwise I was getting the following error:

service/datadog-custom-metrics-server in "datadog" is not listening on port 443

Even though the official Datadog documentation explicit tell us to use 8443:

apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: datadog-custom-metrics-server
  name: datadog-custom-metrics-server
spec:
  ports:
    - protocol: TCP
      port: 8443
      targetPort: 8443
      name: metricsapi
  selector:
    app.kubernetes.io/name: datadog-cluster-agent

https://docs.datadoghq.com/containers/guide/cluster_agent_autoscaling_metrics/?tab=daemonset#register-the-external-metrics-provider-service

gottfrois avatar Jan 11 '23 14:01 gottfrois

I managed to allow the port 8443 in our gke-xxx-yyy-zzz-master firewall rule, and it solved the problem immediately. I still think the documentation needs to be fixed regarding using the 443 port instead of 8443 since this seems to be the port the v1beta1.external.metrics.k8s.io is expecting to use.

gottfrois avatar Jan 11 '23 15:01 gottfrois

Sorry, I don't remember the details around this issue but while reading back my comment I can see that the gist is that nodes were blocking ingress traffic on port 8443 because that's the default firewall behavior on GKE/GCP. Allowing traffic on this port for traffic originating from the master IPs (the firewall rule you pointed out), solves the issue.

memphis88 avatar Jan 11 '23 15:01 memphis88

The reason of this can also be wrong datadog's app-key or absence of it.

ajax-bychenok-y avatar Oct 11 '23 11:10 ajax-bychenok-y

The reason of this can also be wrong datadog's app-key or absence of it.

FWIW, I ran into the same problem and this fixed it. Thank you.

fw42 avatar Sep 16 '24 09:09 fw42

This issue has been automatically marked as stale because it has not had activity in the past 15 days.

It will be closed in 30 days if no further activity occurs. If this issue is still relevant, adding a comment will keep it open. Also, you can always reopen the issue if you missed the window.

Thank you for your contributions!

dd-octo-sts[bot] avatar Oct 21 '25 04:10 dd-octo-sts[bot]