datadog-agent Datadog Agent won't flush when receiving SIGTERM

Describe what happened: When datadog agent receives SIGTERM, it won't flush the metrics before it dies.

Describe what you expected: When it receives SIGTERM, I would expect Datadog Agent flushes out all the currently aggregated metrics before it gracefully shuts down.

Steps to reproduce the issue: Step 1: Start the Datadog Agent Step 2: Emit a metric to Datadog Agent Step 3: Send a SIGTERM signal to Datadog Agent before 15 seconds flush interval ends.

Metrics emitted at step 2 will simply lost and not available.

Additional environment details (Operating System, Cloud provider, etc):

Jul 25 '19 22:07 chenluo1234

Hello @chenluo1234,

Thanks for reporting this. We're tracking this feature request in our backlog. This issue will be updated when we start working on this, until then feel free to contribute if you have some time 🙂!

Aug 05 '19 13:08 KSerrania

Please note that this was partially addressed here: https://github.com/DataDog/datadog-agent/pull/4129; the solution doesn't flush open dogstatsd time buckets, but it would flush everything else:

closed time buckets.
check metrics.

Jan 22 '21 11:01 truthbk

the solution doesn't flush open dogstatsd time buckets ...

@truthbk et al - is there any news on whether open buckets would be addressed any time soon? We're seeing that if we emit a dogstatsd metric to a datadog container sidecar, and the main task exits, the metrics aren't making it consistently back to the datadog api.

Aug 03 '21 19:08 miketheman

I've encountered this issue while using https://github.com/DataDog/agent-github-action in our CI. The Action "gracefully" terminates the container running the agent, but the latter does not flush the metrics I sent to it (in my case statsd gauges) before exiting.

The result is that the metrics don't appear in Datadog and the only solution I found so far was to make my CI process sleep for 10 seconds, which is less than ideal 😅 (and a waste of money).

I'm sugin v7.35.1 of the agent, so that should include #4129, but that doesn't seem to help

May 05 '22 11:05 iMacTia

This is currently a big problem for us running the agent on Heroku. We're using version 7.38.0. When the agent stops and shuts down (using the stop command) unless we wait an additional 30 seconds before stopping the agent, the most recent metrics sent by our app and received by the agent are not forwarded on to the Datadog backend.

In our testing, when attempting to reduce the wait time to 10 seconds or 15 seconds, there are always missing metrics not forwarded to the Datadog backend. Waiting an additional 30 seconds is an excessive time to wait and hardly reasonable.

I'm also in agreement with the previous comment from @iMacTia that the fix for https://github.com/DataDog/datadog-agent/pull/4129 does not appear to help. I would go further and say that https://github.com/DataDog/datadog-agent/pull/4129 is not fixed (at least in version 7.38.0) because when the agent stops it does not appear to forward received metrics on to the Datadog backend.

The only way we can reliably ensure all metrics sent by our app has been received by the Datadog Agent and forwarded on to the Datadog backend is to wait 30 seconds after our app terminates before initiating shutdown of the agent.

Aug 08 '22 14:08 mmercurio

@KSerrania any chance of an update on this? The issue was added to a backlog in August 2019, is it still in the backlog? Looks like there's a deprecated label on this issue too.

Oct 04 '22 05:10 twe4ked

Any hope of ever getting this fixed? We're waiting for up to 30 seconds before shutting down and occasionally it's not enough. This is really a tough pill to swallow and not an acceptable solution at all.

May 19 '23 13:05 mmercurio

Just to clarify, I was mistaken by this comment:

we're waiting for up to 30 seconds before shutting down

We have a script that implements two different time intervals in an attempt to work around this issue and ensure all metrics are forwarded during shutdowns. The first interval is the time to wait between our app terminating and the Datadog agent terminating. This needs to be more than 10 seconds (e.g, 15 seconds) because the flush interval used by the agent is 10 seconds.

We also have a second timeout interval that gives the agent up to 30 seconds to shutdown cleanly. I mistakenly incremented the second timeout interval when I meant to increment the time to wait before shutting down the agent.

For anyone else that might be experiencing issues with missing metrics due to the Datadog agent not forwarding metrics during shutdown, try adding a delay between when your app sends its last metric and when the Datadog agent terminates. This delay must be greater than 10 seconds, in order to give the agent enough time to flush metrics and forward them on to the Datadog backend.

I'd really like for this issue to be addressed so we didn't need to worry about such matters. The agent should forward all metrics received before terminating. Period. I don't understand why this issue is still not resolved after nearly 5 years.

May 24 '23 20:05 mmercurio

Just got hit with this myself. We have a number of VMs that spin up and down that run our application and the agent, and we noticed we were losing the last couple of metrics emitted from the application on exit. After lots of faff, finally traced it back to this.

Our solution is much like the others, to delay shutdown by a few moments. But it's not ideal.

I'm truly begging for this to be prioritised as critical. You would hope that an observability platform would have don't ever lose data as one of it's core principles. At least emit a warning or message that metrics are being lost, goodness me!

Jun 21 '23 13:06 Knifa

This is still an issue, we had to follow the advise above and delay the agent container shutdown otherwise we were missing metrics.

However we were only running the container specifically to send a handful of business metrics and so we switched to using the DataDog HTTP API instead and avoided the need to run an agent container.

Aug 04 '23 06:08 awconstable

Got bit by this again today, even after waiting 30 seconds before shutting down the agent. Apparently 30 seconds is not always long enough to ensure the agent forwards all metrics.

I'm tempted to look at using the HTTP API as @awconstable suggests above, but I have to imagine running into similar issues with connectivity and ensuring metrics are sent successfully.

Aug 11 '23 15:08 mmercurio

This is still happening and is particularly problematic with cron jobs. Recording metrics too close to a shutdown mostly results in those metrics being dropped and lost forever. The only workaround seems to be moving the operation earlier, which isn't always possible.

Related issue: https://github.com/DataDog/datadog-agent/issues/1547

Apr 23 '24 16:04 jaredpetersen

datadog-agent datadog-agent copied to clipboard

Datadog Agent won't flush when receiving SIGTERM

datadog-agent
datadog-agent copied to clipboard