datadog-agent icon indicating copy to clipboard operation
datadog-agent copied to clipboard

[BUG] forwarderHealth fails to recover from failures

Open deadok22 opened this issue 11 months ago • 14 comments

forwarderHealth's healthcheckLoop returns upon hasValidAPIKey returning false here. Once that happens, forwarder's health no longer gets updated.

I observed a cluster check runner instance with an "unhealthy" forwarder that was nonetheless able to forward data to Datadog:

$ grep -B3 'Healthcheck failed' logs.txt | head -4
2024-03-06 17:19:33 UTC | CORE | INFO | (comp/forwarder/defaultforwarder/transaction/transaction.go:392 in internalProcess) | Successfully posted payload to "https://7-50-3-app.agent.datadoghq.com/api/v2/series"
2024-03-06 17:49:33 UTC | CORE | WARN | (comp/forwarder/defaultforwarder/forwarder_health.go:231 in hasValidAPIKey) | api_key '***************************2f6ef' for domain https://api.datadoghq.com/ is invalid
2024-03-06 17:49:33 UTC | CORE | ERROR | (comp/forwarder/defaultforwarder/forwarder_health.go:137 in healthCheckLoop) | No valid api key found, reporting the forwarder as unhealthy.
2024-03-06 17:50:10 UTC | CORE | INFO | (pkg/api/healthprobe/healthprobe.go:75 in healthHandler) | Healthcheck failed on: [forwarder]
$ grep -A100000 nhealthy logs.txt | grep forwarder | grep -v failed | head -3
2024-03-06 17:49:33 UTC | CORE | ERROR | (comp/forwarder/defaultforwarder/forwarder_health.go:137 in healthCheckLoop) | No valid api key found, reporting the forwarder as unhealthy.
2024-03-06 18:22:03 UTC | CORE | INFO | (comp/forwarder/defaultforwarder/transaction/transaction.go:392 in internalProcess) | Successfully posted payload to "https://7-50-3-app.agent.datadoghq.com/api/v1/check_run"
2024-03-06 19:24:33 UTC | CORE | INFO | (comp/forwarder/defaultforwarder/transaction/transaction.go:392 in internalProcess) | Successfully posted payload to "https://7-50-3-app.agent.datadoghq.com/api/v2/series"

deadok22 avatar Mar 07 '24 01:03 deadok22

I shared a flare from that instance in this support case, too: https://help.datadoghq.com/hc/en-us/requests/1583229

deadok22 avatar Mar 07 '24 01:03 deadok22

Same here. Many clusters in production are experiencing the same issue with datadog-agent. image

posquit0 avatar Mar 07 '24 01:03 posquit0

I guess it's worth pointing out that this failure can only occur if the Datadog backend responds with a 403, if I'm reading the code right. That hints at a problem with api key validation in the backend..

deadok22 avatar Mar 07 '24 01:03 deadok22

I can confirm it happens for us as well. A simple restart does the job, so bad the issue doesn't cause the liveness probe to fail

aokomorowski avatar Mar 07 '24 09:03 aokomorowski

This has also happened for us as well today. As others have mentioned, restarting the pods with the unready agent containers looks to have sorted things as of about an hour ago.

theplatformer avatar Mar 07 '24 10:03 theplatformer

I can confirm this issue too.

smaugs avatar Mar 07 '24 15:03 smaugs

Same. We got alerts on readiness failing, and were scrambling to understand this yesterday.

pcn avatar Mar 07 '24 21:03 pcn

There was an incident on datadog yesterday. We also had some instances impacted, but we only saw them today. https://status.datadoghq.com/incidents/q2d98y2qv54j

voiski avatar Mar 07 '24 22:03 voiski

Hi everyone, Thank you for reporting this.

The problem (as @voiski correctly identified) is tied to an already-known short-lived incident that occurred yesterday that could have impacted the API key validation of some Agents. If needed and as others have suggested, a restart of the Agent should correct the health reporting. We will be revisiting the health logic of the Agent in the future to see what we can do to prevent this from reoccurring.

I'll leave the issue open for a few days for visibility but will close it after that since the incident has been resolved.

Thanks, Srdjan

sgnn7 avatar Mar 07 '24 22:03 sgnn7

@sgnn7 thanks for the update. I'm glad to hear the backend issue got resolved.

I'll leave the issue open for a few days for visibility but will close it after that since the incident has been resolved.

Please consider fixing the healthcheckLoop ending, as per the issue's description. With that patched, customers' agents will be able to recover without requiring an operator's attention.

deadok22 avatar Mar 07 '24 23:03 deadok22

When a pod fails the readiness check, Kubernetes is unable to route traffic to it via the service. On a pod with 4/4 containers ready, I can do this:

% kubectl -n datadog exec -it datadog-agent-xgdbx -c agent -- curl agent.datadog.svc.cluster.local.:8126 404 page not found

On a pod where the agent container is failing its readiness check, the command just hangs indefinitely, presumably because Kubernetes is unwilling to route traffic to a pod that is not ready.

At least this is noticeable. When sending UDP metrics traffic to the service, it's likely that the packets are just silently dropped.

bberg-indeed avatar Mar 08 '24 01:03 bberg-indeed

@sgnn7

@bberg-indeed's comment implies that Datadog customers that use Kubernetes Service resources to route traffic to the agents who haven't restarted the agents are still losing data as a result of https://status.datadoghq.com/incidents/q2d98y2qv54j

deadok22 avatar Mar 08 '24 01:03 deadok22

@deadok22

Please consider fixing the healthcheckLoop ending, as per the issue's description. With that patched, customers' agents will be able to recover without requiring an operator's attention.

Fair. I'll wait for the postmortem but as mentioned in my original comment, we will be looking into this. Overall, it is a deceptively tricky question to answer as to what should happen to the Agent though if a key is detected as "invalid" that may not have great trade-offs in terms of overall UX.

sgnn7 avatar Mar 08 '24 21:03 sgnn7

@sgnn7 If this is the trick, can we make flags to allow retry for auto recovery or other mechanisms behind some config? So you can preserve current behavior but allow customization. After all, it looks like it was an intermittent problem.

As a use case, manually triggered aws-ecs tasks will stay up and blocked if sidecars fail. As seen, the datadog agent recovered and sent signals to the server, but the agent health command was kept as failed. Giving more context: Manual ecs task triggers can be part of an automated workflow. Having it self-recover would simplify operations and avoid incidents.

voiski avatar Mar 08 '24 23:03 voiski

Thank you for reporting the issue here, and we appreciate all the contributors for their responses! As mentioned in the previous comments, Datadog engineers detected on Mar 6 and resolved the cause promptly.

If you're interested in learning more details about the issue and its resolution, feel free to request a Root Cause Analysis (RCA) by opening a ticket with our Datadog support (you can reference the status page link https://status.datadoghq.com/incidents/q2d98y2qv54j in the support ticket).

Thank you all for your input, and I will mark this thread as resolved.

sgnn7 avatar Mar 25 '24 20:03 sgnn7

Reopening this issue since the Agent changes to retry API key validation failures hasn't been implemented yet.

jszwedko avatar Apr 01 '24 16:04 jszwedko

Fixed in #24533

lukesteensen avatar Apr 23 '24 19:04 lukesteensen