datadog-agent
datadog-agent copied to clipboard
[BUG] forwarderHealth fails to recover from failures
forwarderHealth
's healthcheckLoop
returns upon hasValidAPIKey
returning false here. Once that happens, forwarder's health no longer gets updated.
I observed a cluster check runner instance with an "unhealthy" forwarder that was nonetheless able to forward data to Datadog:
$ grep -B3 'Healthcheck failed' logs.txt | head -4
2024-03-06 17:19:33 UTC | CORE | INFO | (comp/forwarder/defaultforwarder/transaction/transaction.go:392 in internalProcess) | Successfully posted payload to "https://7-50-3-app.agent.datadoghq.com/api/v2/series"
2024-03-06 17:49:33 UTC | CORE | WARN | (comp/forwarder/defaultforwarder/forwarder_health.go:231 in hasValidAPIKey) | api_key '***************************2f6ef' for domain https://api.datadoghq.com/ is invalid
2024-03-06 17:49:33 UTC | CORE | ERROR | (comp/forwarder/defaultforwarder/forwarder_health.go:137 in healthCheckLoop) | No valid api key found, reporting the forwarder as unhealthy.
2024-03-06 17:50:10 UTC | CORE | INFO | (pkg/api/healthprobe/healthprobe.go:75 in healthHandler) | Healthcheck failed on: [forwarder]
$ grep -A100000 nhealthy logs.txt | grep forwarder | grep -v failed | head -3
2024-03-06 17:49:33 UTC | CORE | ERROR | (comp/forwarder/defaultforwarder/forwarder_health.go:137 in healthCheckLoop) | No valid api key found, reporting the forwarder as unhealthy.
2024-03-06 18:22:03 UTC | CORE | INFO | (comp/forwarder/defaultforwarder/transaction/transaction.go:392 in internalProcess) | Successfully posted payload to "https://7-50-3-app.agent.datadoghq.com/api/v1/check_run"
2024-03-06 19:24:33 UTC | CORE | INFO | (comp/forwarder/defaultforwarder/transaction/transaction.go:392 in internalProcess) | Successfully posted payload to "https://7-50-3-app.agent.datadoghq.com/api/v2/series"
I shared a flare from that instance in this support case, too: https://help.datadoghq.com/hc/en-us/requests/1583229
Same here. Many clusters in production are experiencing the same issue with datadog-agent
.
I guess it's worth pointing out that this failure can only occur if the Datadog backend responds with a 403, if I'm reading the code right. That hints at a problem with api key validation in the backend..
I can confirm it happens for us as well. A simple restart does the job, so bad the issue doesn't cause the liveness probe to fail
This has also happened for us as well today. As others have mentioned, restarting the pods with the unready agent containers looks to have sorted things as of about an hour ago.
I can confirm this issue too.
Same. We got alerts on readiness failing, and were scrambling to understand this yesterday.
There was an incident on datadog yesterday. We also had some instances impacted, but we only saw them today. https://status.datadoghq.com/incidents/q2d98y2qv54j
Hi everyone, Thank you for reporting this.
The problem (as @voiski correctly identified) is tied to an already-known short-lived incident that occurred yesterday that could have impacted the API key validation of some Agents. If needed and as others have suggested, a restart of the Agent should correct the health reporting. We will be revisiting the health logic of the Agent in the future to see what we can do to prevent this from reoccurring.
I'll leave the issue open for a few days for visibility but will close it after that since the incident has been resolved.
Thanks, Srdjan
@sgnn7 thanks for the update. I'm glad to hear the backend issue got resolved.
I'll leave the issue open for a few days for visibility but will close it after that since the incident has been resolved.
Please consider fixing the healthcheckLoop
ending, as per the issue's description. With that patched, customers' agents will be able to recover without requiring an operator's attention.
When a pod fails the readiness check, Kubernetes is unable to route traffic to it via the service. On a pod with 4/4 containers ready, I can do this:
% kubectl -n datadog exec -it datadog-agent-xgdbx -c agent -- curl agent.datadog.svc.cluster.local.:8126 404 page not found
On a pod where the agent container is failing its readiness check, the command just hangs indefinitely, presumably because Kubernetes is unwilling to route traffic to a pod that is not ready.
At least this is noticeable. When sending UDP metrics traffic to the service, it's likely that the packets are just silently dropped.
@sgnn7
@bberg-indeed's comment implies that Datadog customers that use Kubernetes Service resources to route traffic to the agents who haven't restarted the agents are still losing data as a result of https://status.datadoghq.com/incidents/q2d98y2qv54j
@deadok22
Please consider fixing the
healthcheckLoop
ending, as per the issue's description. With that patched, customers' agents will be able to recover without requiring an operator's attention.
Fair. I'll wait for the postmortem but as mentioned in my original comment, we will be looking into this. Overall, it is a deceptively tricky question to answer as to what should happen to the Agent though if a key is detected as "invalid" that may not have great trade-offs in terms of overall UX.
@sgnn7 If this is the trick, can we make flags to allow retry for auto recovery or other mechanisms behind some config? So you can preserve current behavior but allow customization. After all, it looks like it was an intermittent problem.
As a use case, manually triggered aws-ecs tasks will stay up and blocked if sidecars fail. As seen, the datadog agent recovered and sent signals to the server, but the agent health
command was kept as failed. Giving more context: Manual ecs task triggers can be part of an automated workflow. Having it self-recover would simplify operations and avoid incidents.
Thank you for reporting the issue here, and we appreciate all the contributors for their responses! As mentioned in the previous comments, Datadog engineers detected on Mar 6 and resolved the cause promptly.
If you're interested in learning more details about the issue and its resolution, feel free to request a Root Cause Analysis (RCA) by opening a ticket with our Datadog support (you can reference the status page link https://status.datadoghq.com/incidents/q2d98y2qv54j in the support ticket).
Thank you all for your input, and I will mark this thread as resolved.
Reopening this issue since the Agent changes to retry API key validation failures hasn't been implemented yet.
Fixed in #24533