datadog-cluster-agent-metrics-api failing with FailedDiscoveryCheck seemingly randomly
Output of the info page (if this is a bug)
(Paste the output of the info page here)
Describe what happened: We're relying on datadog external metrics to autoscale some of our applications and recently we've been noticing some weird cluster-agent behavior. 99,9% of the time it works as intented, but sometimes at seemingly random the datadog-cluster-agent-metrics-api goes unavailable with FailedDiscoveryCheck.
❯ k get apiservice v1beta1.external.metrics.k8s.io
NAME SERVICE AVAILABLE AGE
v1beta1.external.metrics.k8s.io datadog/datadog-cluster-agent-metrics-api False (FailedDiscoveryCheck) 148d
There's nothing in kube events when this happens. I'm not seeing anything too suspicious in cluster-agent logs. Container reports being totally healthy so no automatic restart. Logs & Metrics & everything continue working normally. But everything relying on the apiservice like the HPAs stop working correctly.
After restarting the cluster-agent it starts responding normally again:
❯ k get apiservice v1beta1.external.metrics.k8s.io
NAME SERVICE AVAILABLE AGE
v1beta1.external.metrics.k8s.io datadog/datadog-cluster-agent-metrics-api True 148d
Describe what you expected: Based on this & the fact that restart solves the issue I'd maybe expect cluster-agent health check to fail if the apiservice fails. Not sure if that's a good idea or easily doable.
Steps to reproduce the issue: Again, I'm not sure how to reproduce or debug this really as it happens seemingly at random. We've seen it working 100% fine for weeks or even months before randomly failing at some point.
Additional environment details (Operating System, Cloud provider, etc): On AWS EKS kube Server Version: v1.19.8-eks-96780e datadog cluster agent version: 1.12.0 (installed with https://github.com/DataDog/helm-charts/tree/master/charts/datadog)
My .02: Our cluster runs on GKE as a private cluster. After inspecting the corresponding api service, I could see a timeout message on port 8443:
➜ ~ kubectl describe apiservice v1beta1.external.metrics.k8s.io
[...]
Status:
Conditions:
Last Transition Time: 2021-04-14T07:29:52Z
Message: failing or missing response from https://172.24.2.187:8443/apis/external.metrics.k8s.io/v1beta1: Get "https://172.24.2.187:8443/apis/external.metrics.k8s.io/v1beta1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Reason: FailedDiscoveryCheck
Status: False
Type: Available
Opening up port 8443 for traffic originating from the master fixed this immediately in our case.
@memphis88 I'm running into the same problem, except I always get the timeout error even after a clean install of datadog agent and its cluster agent following the documentation.
I'm also on GKE 1.24.7-gke.900, can you please describe how you opened the 8443 port please?
Actually I had to update the Service port from 8443 to use 443 instead otherwise I was getting the following error:
service/datadog-custom-metrics-server in "datadog" is not listening on port 443
Even though the official Datadog documentation explicit tell us to use 8443:
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: datadog-custom-metrics-server
name: datadog-custom-metrics-server
spec:
ports:
- protocol: TCP
port: 8443
targetPort: 8443
name: metricsapi
selector:
app.kubernetes.io/name: datadog-cluster-agent
https://docs.datadoghq.com/containers/guide/cluster_agent_autoscaling_metrics/?tab=daemonset#register-the-external-metrics-provider-service
I managed to allow the port 8443 in our gke-xxx-yyy-zzz-master firewall rule, and it solved the problem immediately.
I still think the documentation needs to be fixed regarding using the 443 port instead of 8443 since this seems to be the port the v1beta1.external.metrics.k8s.io is expecting to use.
Sorry, I don't remember the details around this issue but while reading back my comment I can see that the gist is that nodes were blocking ingress traffic on port 8443 because that's the default firewall behavior on GKE/GCP. Allowing traffic on this port for traffic originating from the master IPs (the firewall rule you pointed out), solves the issue.
The reason of this can also be wrong datadog's app-key or absence of it.
The reason of this can also be wrong datadog's app-key or absence of it.
FWIW, I ran into the same problem and this fixed it. Thank you.
This issue has been automatically marked as stale because it has not had activity in the past 15 days.
It will be closed in 30 days if no further activity occurs. If this issue is still relevant, adding a comment will keep it open. Also, you can always reopen the issue if you missed the window.
Thank you for your contributions!