dd-agent icon indicating copy to clipboard operation
dd-agent copied to clipboard

Agent/datadogstatsd doesn't restart after being killed due to OOM

Open petedmarsh opened this issue 6 years ago • 6 comments

I've pieced the following together as best I could but I'm not particularly knowledgeable about system operation/management so please forgive me if I've made a mistake :)

Last night the datadogstatsd and forwarder processes on one of my machines were terminated and did not restart. That machine hit ~100% memory usage overnight (we had a bunch of other problems due to that too).

Looking at the supervisor config for these processes I noticed that autorestart and exitcodes are not explicitly defined:

[program:dogstatsd]
command=/opt/datadog-agent/embedded/bin/python /opt/datadog-agent/agent/dogstatsd.py --use-local-forwarder
stdout_logfile=NONE
stderr_logfile=NONE
startsecs=5
startretries=3
priority=998
user=dd-agent

This means autorestart will default to unexpected, with the exitcodes defaulting to 0,2 (http://supervisord.org/configuration.html)

Looking at the logs for the forwarder process I can see this:

2017-08-08 23:40:37 UTC | INFO | dd.forwarder | forwarder(ddagent.py:571) | caught sigterm. stopping
2017-08-08 23:40:37 UTC | INFO | dd.forwarder | forwarder(ddagent.py:553) | Stopped

And looking at the agent code SIGTERM is handled like so:

    # https://github.com/DataDog/dd-agent/blob/master/ddagent.py#L592
    def sigterm_handler(signum, frame):
        log.info("caught sigterm. stopping")
        app.stop()

# which calls

     # https://github.com/DataDog/dd-agent/blob/master/ddagent.py#L577
    def stop(self):
        self.mloop.stop()

As I understand it this will cause the process to quit with exit code 0 as no other code is specified, rather than 128 + SIGTERM. As the exit code is 0 supervisord doesn't consider it an unexpected shutdown and so does not restart the process.

As I said I'm not super knowledgeable about these things - if the above is true then should the process exit with 128 + SIGTERM as the exit code, and if I'm wrong then would it be resonable to add autorestart=true to the supervisor config for these processes? As far as I can tell you always want your datadog processeses to restart automatically unless you explicitly kill them.

petedmarsh avatar Aug 09 '17 10:08 petedmarsh

Hey @petedmarsh! Thanks for the bug report!

I'm not sure what we should do offhand. The problem with a sigterm is that it could indicate a reasonable exit scenario in which nothing wrong happened. For instance, if I run it on my command line and ctrl-c, it should exit without error. However, in this case, I can see it being problematic. We're working hard to make the agent more resilient, so this is definitely something to look at.

Can you open up a case with support so that you can send us some more details about your environment? I understand if you don't want to share them here.

You're right though, it should restart and it's definitely a bug if it doesn't! Thanks a lot for your bug report!

gmmeyer avatar Aug 19 '17 19:08 gmmeyer

What about setting autorestart=true in the supervisor config for each process? The processes would then always restart regardless of exit codes - if you wanted to disable a process then you could use supervisor to turn it off.

On 19 Aug 2017 8:54 pm, "Greg Meyer" [email protected] wrote:

Hey @petedmarsh https://github.com/petedmarsh! Thanks for the bug report!

I'm not sure what we should do offhand. The problem with a sigterm is that it could indicate a reasonable exit scenario in which nothing wrong happened. For instance, if I run it on my command line and ctrl-c, it should exit without error. However, in this case, I can see it being problematic. We're working hard to make the agent more resilient, so this is definitely something to look at.

Can you open up a case with support so that you can send us some more details about your environment? I understand if you don't want to share them here.

You're right though, it should restart and it's definitely a bug if it doesn't! Thanks a lot for your bug report!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DataDog/dd-agent/issues/3471#issuecomment-323544335, or mute the thread https://github.com/notifications/unsubscribe-auth/ABIxemC2w_22QJl67pSqZJAjSjvFqYyeks5sZz15gaJpZM4Ox4EV .

petedmarsh avatar Aug 19 '17 21:08 petedmarsh

We don't always want it to restart. Sometimes it should fail to start, for example if you have a bad config file it shouldn't keep trying to restart it.

gmmeyer avatar Aug 20 '17 00:08 gmmeyer

We'll keep you updated, this is an important issue and we'll work on trying to get it resolved. Resiliency is very important to us! 😄

gmmeyer avatar Aug 20 '17 00:08 gmmeyer

Is there any update on this topic for Agent v 6?

abeluck avatar Sep 24 '18 06:09 abeluck

I'm curious about updates as well. We have a datadog container running in ECS Fargate that does not restart after being killed due to OOM errors. Has any progress been made on this?

rpdelaney avatar Apr 15 '21 15:04 rpdelaney