aws-greengrass-nucleus icon indicating copy to clipboard operation
aws-greengrass-nucleus copied to clipboard

Greengrass is not shutting down components properly upon shutdown and restarts them on restart

Open cthulhuology opened this issue 3 years ago • 2 comments

After noticing that components were still running after GG failed to shut them down upon shutdown, they were reparented to init. Then when greengrass restarted, duplicate processes were started. After seeing this I created a few test programs and saw the same behaviors.

For example, I have two components d and g.

root@raspberrypi:/home/pi/forkbug# pstree -p
systemd(1)─┬─alsactl(339)
           ├─avahi-daemon(326)───avahi-daemon(340)
           ├─bluetoothd(522)
           ├─cron(328)
           ├─dbus-daemon(325)
           ├─dhcpcd(504)
           ├─hciattach(516)
           ├─login(524)───bash(635)
           ├─rngd(384)─┬─{rngd}(385)
           │           ├─{rngd}(386)
           │           └─{rngd}(387)
           ├─rsyslogd(329)─┬─{rsyslogd}(381)
           │               ├─{rsyslogd}(382)
           │               └─{rsyslogd}(383)
           ├─sshd(525)─┬─sshd(7691)───sshd(7697)───bash(7698)───sudo(9892)───bash(9893)───pstree(10625)
           │           └─sshd(9091)───sshd(9097)───bash(9098)───sudo(10367)───java(10368)─┬─sudo(10462)───sh(10465)───g(10466)─┬─g(10467)───g(10473)
           │                                                                              │                                    ├─g(10468)───g(10475)
           │                                                                              │                                    ├─g(10469)───g(10474)
           │                                                                              │                                    ├─g(10470)───g(10472)
           │                                                                              │                                    └─g(10471)───g(10476)
           │                                                                              ├─sudo(10521)───sh(10525)───d(10526)─┬─d(10527)───d(10532)
           │                                                                              │                                    ├─d(10528)───d(10535)
           │                                                                              │                                    ├─d(10529)───d(10533)
           │                                                                              │                                    ├─d(10530)───d(10534)
           │                                                                              │                                    └─d(10531)───d(10536)
           │                                                                              ├─{java}(10369)
           │                                                                              ├─{java}(10370)
           │                                                                              ├─{java}(10371)
           │                                                                              ├─{java}(10372)
           │                                                                              ├─{java}(10373)
           │                                                                              ├─{java}(10374)
           │                                                                              ├─{java}(10375)
           │                                                                              ├─{java}(10376)
           │                                                                              ├─{java}(10377)
           │                                                                              ├─{java}(10378)
           │                                                                              ├─{java}(10380)
           │                                                                              ├─{java}(10384)
           │                                                                              ├─{java}(10426)
           │                                                                              ├─{java}(10427)
           │                                                                              ├─{java}(10428)
           │                                                                              ├─{java}(10429)
           │                                                                              ├─{java}(10430)
           │                                                                              ├─{java}(10431)
           │                                                                              ├─{java}(10432)
           │                                                                              ├─{java}(10433)
           │                                                                              ├─{java}(10434)
           │                                                                              ├─{java}(10435)
           │                                                                              ├─{java}(10436)
           │                                                                              ├─{java}(10437)
           │                                                                              ├─{java}(10438)
           │                                                                              ├─{java}(10439)
           │                                                                              ├─{java}(10440)
           │                                                                              ├─{java}(10441)
           │                                                                              ├─{java}(10442)
           │                                                                              ├─{java}(10443)
           │                                                                              ├─{java}(10444)
           │                                                                              ├─{java}(10445)
           │                                                                              ├─{java}(10447)
           │                                                                              ├─{java}(10450)
           │                                                                              ├─{java}(10451)
           │                                                                              ├─{java}(10452)
           │                                                                              ├─{java}(10453)
           │                                                                              ├─{java}(10454)
           │                                                                              ├─{java}(10457)
           │                                                                              ├─{java}(10459)
           │                                                                              ├─{java}(10463)
           │                                                                              ├─{java}(10464)
           │                                                                              ├─{java}(10522)
           │                                                                              ├─{java}(10523)
           │                                                                              └─{java}(10524)
           ├─systemd(624)───(sd-pam)(625)
           ├─systemd-journal(123)
           ├─systemd-logind(335)
           ├─systemd-timesyn(286)───{systemd-timesyn}(323)
           ├─systemd-udevd(145)
           ├─thd(354)
           ├─wpa_supplicant(338)
           └─wpa_supplicant(431) 

d has a top level supervisor and spawns 5 workers, restarting any workers that fail. In d each worker that manages a socket, also has a child process that it manages individual requests. The worker spawns a new child for each request. The g component does the same thing, but also creates separate process groups per worker, but has the top level supervisor act as child subreaper via prctl(PR_SET_CHILD_SUBREAPER). so that individual worker groups can be killed, but should a worker die it's children get reaped by the supervisor. In both cases the top level d and g components don't demonsize and send signals to their child processes to terminate.

When greengrass shuts down and attempts to shutdown the processes it enters into an errored state:

2021-04-08T12:22:59.471Z [INFO] (aws.greengrass.Cli-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: Shutdown initiated. {serviceName=d, currentState=STOPPING}
2021-04-08T12:22:59.473Z [INFO] (aws.greengrass.Cli-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: Shutting down process ["/greengrass/packages/artifacts/d/0.0.1/d >/tmp/d.log 2>&1\n"]. {serviceName=d, currentState=STOPPING}
2021-04-08T12:22:59.478Z [INFO] (UpdateSystemPolicyService-lifecycle) com.aws.greengrass.lifecyclemanager.UpdateSystemPolicyService: service-report-state. {serviceName=UpdateSystemPolicyService, currentState=STOPPING, newState=FINISHED}
2021-04-08T12:22:59.478Z [INFO] (pool-2-thread-23) com.aws.greengrass.lifecyclemanager.GenericExternalService: Shutdown initiated. {serviceName=g, currentState=STOPPING}
2021-04-08T12:22:59.479Z [INFO] (UpdateSystemPolicyService-lifecycle) com.aws.greengrass.lifecyclemanager.UpdateSystemPolicyService: Stopping backingTask start. {serviceName=UpdateSystemPolicyService, currentState=STOPPING}
2021-04-08T12:22:59.480Z [INFO] (UpdateSystemPolicyService-lifecycle) com.aws.greengrass.lifecyclemanager.UpdateSystemPolicyService: service-set-state. {serviceName=UpdateSystemPolicyService, currentState=STOPPING, newState=FINISHED}
2021-04-08T12:22:59.481Z [INFO] (pool-2-thread-23) com.aws.greengrass.lifecyclemanager.GenericExternalService: Shutting down process ["/greengrass/packages/artifacts/g/0.0.1/g >/tmp/g.log 2>&1\n"]. {serviceName=g, currentState=STOPPING}
2021-04-08T12:22:59.482Z [WARN] (pool-2-thread-13) com.aws.greengrass.lifecyclemanager.UpdateSystemPolicyService: service-run-interrupted. Service interrupted while running startup. {serviceName=UpdateSystemPolicyService, currentState=FINISHED}
2021-04-08T12:22:59.526Z [INFO] (pool-2-thread-23) com.aws.greengrass.util.platforms.Platform: Killing child processes of pid 10632. {}
2021-04-08T12:22:59.527Z [INFO] (aws.greengrass.Cli-lifecycle) com.aws.greengrass.util.platforms.Platform: Killing child processes of pid 10634. {}
2021-04-08T12:22:59.593Z [INFO] (DeploymentService-lifecycle) com.aws.greengrass.deployment.DeploymentService: service-report-state. {serviceName=DeploymentService, currentState=STOPPING, newState=FINISHED}
2021-04-08T12:22:59.595Z [INFO] (DeploymentService-lifecycle) com.aws.greengrass.deployment.DeploymentService: Stopping backingTask start. {serviceName=DeploymentService, currentState=STOPPING}
2021-04-08T12:22:59.596Z [WARN] (pool-2-thread-15) com.aws.greengrass.deployment.DeploymentService: service-run-interrupted. Service interrupted while running startup. {serviceName=DeploymentService, currentState=STOPPING}
2021-04-08T12:22:59.596Z [INFO] (DeploymentService-lifecycle) com.aws.greengrass.deployment.DeploymentService: service-set-state. {serviceName=DeploymentService, currentState=STOPPING, newState=FINISHED}
2021-04-08T12:23:14.471Z [INFO] (aws.greengrass.Cli-lifecycle) com.aws.greengrass.util.platforms.Platform: Killing child processes of pid 10634. {}
2021-04-08T12:23:14.471Z [ERROR] (d-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-errored. {reason=Timeout in shutdown, serviceName=d, currentState=STOPPING}
2021-04-08T12:23:14.474Z [INFO] (d-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-report-state. {serviceName=d, currentState=STOPPING, newState=ERRORED}
2021-04-08T12:23:14.476Z [INFO] (d-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-set-state. {serviceName=d, currentState=STOPPING, newState=ERRORED}
2021-04-08T12:23:14.479Z [INFO] (pool-2-thread-23) com.aws.greengrass.util.platforms.Platform: Killing child processes of pid 10632. {}
2021-04-08T12:23:14.478Z [ERROR] (g-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-errored. {reason=Timeout in shutdown, serviceName=g, currentState=STOPPING}
2021-04-08T12:23:14.480Z [INFO] (g-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-report-state. {serviceName=g, currentState=STOPPING, newState=ERRORED}
2021-04-08T12:23:14.482Z [INFO] (g-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-set-state. {serviceName=g, currentState=STOPPING, newState=ERRORED}
2021-04-08T12:23:14.492Z [INFO] (Serialized listener processor) com.aws.greengrass.lifecyclemanager.KernelLifecycle: executor-service-shutdown-initiated. {}
2021-04-08T12:23:14.492Z [INFO] (aws.greengrass.Cli-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: Shutdown completed for process ["/greengrass/packages/artifacts/d/0.0.1/d >/tmp/d.log 2>&1\n"]. {serviceName=d, currentState=ERRORED}
2021-04-08T12:23:14.493Z [INFO] (aws.greengrass.Cli-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: generic-service-shutdown. {serviceName=d, currentState=ERRORED}
2021-04-08T12:23:14.492Z [INFO] (pool-2-thread-23) com.aws.greengrass.lifecyclemanager.GenericExternalService: Shutdown completed for process ["/greengrass/packages/artifacts/g/0.0.1/g >/tmp/g.log 2>&1\n"]. {serviceName=g, currentState=ERRORED}
2021-04-08T12:23:14.497Z [INFO] (pool-2-thread-23) com.aws.greengrass.lifecyclemanager.GenericExternalService: generic-service-shutdown. {serviceName=g, currentState=ERRORED}
2021-04-08T12:23:14.495Z [INFO] (Thread-1) com.aws.greengrass.lifecyclemanager.KernelLifecycle: Waiting for executors to shutdown. {}
2021-04-08T12:23:14.495Z [WARN] (FleetStatusService-lifecycle) com.aws.greengrass.status.FleetStatusService: service-state-transition-interrupted. Service lifecycle thread interrupted. Thread will exit now. {serviceName=FleetStatusService, currentState=FINISHED}
2021-04-08T12:23:14.497Z [INFO] (Copier) com.aws.greengrass.lifecyclemanager.GenericExternalService: Run script exited. {exitCode=143, serviceName=d, currentState=ERRORED}
2021-04-08T12:23:14.501Z [INFO] (Copier) d: Run script exited. {exitCode=143, serviceName=d, currentState=ERRORED}
2021-04-08T12:23:14.504Z [INFO] (Thread-1) com.aws.greengrass.lifecyclemanager.KernelLifecycle: executor-service-shutdown-complete. {}
2021-04-08T12:23:14.499Z [INFO] (Copier) com.aws.greengrass.lifecyclemanager.GenericExternalService: Run script exited. {exitCode=143, serviceName=g, currentState=ERRORED}
2021-04-08T12:23:14.510Z [INFO] (Copier) g: Run script exited. {exitCode=143, serviceName=g, currentState=ERRORED}
2021-04-08T12:23:14.529Z [INFO] (Thread-1) com.aws.greengrass.lifecyclemanager.KernelLifecycle: context-shutdown-initiated. {}
2021-04-08T12:23:14.572Z [INFO] (Serialized listener processor) com.aws.greengrass.dependency.Context: Interrupted while running tasks. Publish thread will exit now.. {}
2021-04-08T12:23:14.572Z [INFO] (Thread-1) com.aws.greengrass.lifecyclemanager.KernelLifecycle: context-shutdown-complete. {}

The killProcessAndChildren call successfully terminates the sh wrapper, but because the other worker respawn their children the methodology for reading the pids does not work. And since it tries to terminate each process listed (from a static list) rather than sending a signal to the entire process group, it fails to terminate the components.

systemd(1)─┬─alsactl(339)
           ├─avahi-daemon(326)───avahi-daemon(340)
           ├─bluetoothd(522)
           ├─cron(328)
           ├─d(10647)─┬─d(10652)───d(10657)
           │          ├─d(10653)───d(10658)
           │          ├─d(10654)───d(10659)
           │          ├─d(10655)───d(10660)
           │          └─d(10656)───d(10661)
           ├─dbus-daemon(325)
           ├─dhcpcd(504)
           ├─g(10467)───g(10473)
           ├─g(10468)───g(10475)
           ├─g(10469)───g(10474)
           ├─g(10470)───g(10472)
           ├─g(10471)───g(10476)
           ├─g(10638)─┬─g(10641)───g(10646)
           │          ├─g(10642)───g(10648)
           │          ├─g(10643)───g(10650)
           │          ├─g(10644)───g(10651)
           │          └─g(10645)───g(10649)
           ├─hciattach(516)
           ├─login(524)───bash(635)
           ├─rngd(384)─┬─{rngd}(385)
           │           ├─{rngd}(386)
           │           └─{rngd}(387)
           ├─rsyslogd(329)─┬─{rsyslogd}(381)
           │               ├─{rsyslogd}(382)
           │               └─{rsyslogd}(383)
           ├─sshd(525)─┬─sshd(7691)───sshd(7697)───bash(7698)───sudo(9892)───bash(9893)───pstree(10733)
           │           └─sshd(9091)───sshd(9097)───bash(9098)
           ├─systemd(624)───(sd-pam)(625)
           ├─systemd-journal(123)
           ├─systemd-logind(335)
           ├─systemd-timesyn(286)───{systemd-timesyn}(323)
           ├─systemd-udevd(145)
           ├─thd(354)
           ├─wpa_supplicant(338)
           └─wpa_supplicant(431)

As you can see d and g get reparented to init, and some of the prior processes were killed and new ones were spawned in their place. The net result is they are still mostly running. Then if you startup greengrass again:

     ├─g(10638)─┬─g(10641)───g(10646)
       │          ├─g(10642)───g(10648)
       │          ├─g(10643)───g(10650)
       │          ├─g(10644)───g(10651)
       │          └─g(10645)───g(10649)
       ├─hciattach(516)
       ├─login(524)───bash(635)
       ├─rngd(384)─┬─{rngd}(385)
       │           ├─{rngd}(386)
       │           └─{rngd}(387)
       ├─rsyslogd(329)─┬─{rsyslogd}(381)
       │               ├─{rsyslogd}(382)
       │               └─{rsyslogd}(383)
       ├─sshd(525)─┬─sshd(7691)───sshd(7697)───bash(7698)───sudo(9892)───bash(9893)───pstree(10860)
       │           └─sshd(9091)───sshd(9097)───bash(9098)───sudo(10734)───java(10735)─┬─sudo(10828)───sh(10835)───g(10837)─┬─g(10839)───g(10848)
       │                                                                              │                                    ├─g(10840)───g(10847)
       │                                                                              │                                    ├─g(10841)───g(10845)
       │                                                                              │                                    ├─g(10842)───g(10844)
       │                                                                              │                                    └─g(10843)───g(10846)
       │                                                                              ├─sudo(10829)───sh(10836)───d(10838)─┬─d(10849)───d(10854)
       │                                                                              │                                    ├─d(10850)───d(10855)
       │                                                                              │                                    ├─d(10851)───d(10856)
       │                                                                              │                                    ├─d(10852)───d(10858)
       │                                                                              │                                    └─d(10853)───d(10857)
       │                                                                              ├─{java}(10736)
       │                                                                              ├─{java}(10737)
       │                                                                              ├─{java}(10738)
       │                                                                              ├─{java}(10739)
       │                                                                              ├─{java}(10740)
       │                                                                              ├─{java}(10741)
       │                                                                              ├─{java}(10742)
       │                                                                              ├─{java}(10743)
       │                                                                              ├─{java}(10744)
       │                                                                              ├─{java}(10745)
       │                                                                              ├─{java}(10746)
       │                                                                              ├─{java}(10750)
       │                                                                              ├─{java}(10790)
       │                                                                              ├─{java}(10791)
       │                                                                              ├─{java}(10792)
       │                                                                              ├─{java}(10793)
       │                                                                              ├─{java}(10794)
       │                                                                              ├─{java}(10796)
       │                                                                              ├─{java}(10797)
       │                                                                              ├─{java}(10798)
       │                                                                              ├─{java}(10799)
       │                                                                              ├─{java}(10800)
       │                                                                              ├─{java}(10801)
       │                                                                              ├─{java}(10802)
       │                                                                              ├─{java}(10803)
       │                                                                              ├─{java}(10804)
       │                                                                              ├─{java}(10805)
       │                                                                              ├─{java}(10806)
       │                                                                              ├─{java}(10807)
       │                                                                              ├─{java}(10808)
       │                                                                              ├─{java}(10809)
       │                                                                              ├─{java}(10810)
       │                                                                              ├─{java}(10811)
       │                                                                              ├─{java}(10812)
       │                                                                              ├─{java}(10813)
       │                                                                              ├─{java}(10814)
       │                                                                              ├─{java}(10816)
       │                                                                              ├─{java}(10817)
       │                                                                              ├─{java}(10818)
       │                                                                              ├─{java}(10820)
       │                                                                              ├─{java}(10822)
       │                                                                              ├─{java}(10823)
       │                                                                              ├─{java}(10824)
       │                                                                              ├─{java}(10825)
       │                                                                              ├─{java}(10826)
       │                                                                              ├─{java}(10827)
       │                                                                              ├─{java}(10830)
       │                                                                              ├─{java}(10831)
       │                                                                              ├─{java}(10832)
       │                                                                              ├─{java}(10833)
       │                                                                              └─{java}(10834)
       ├─systemd(624)───(sd-pam)(625)
       ├─systemd-journal(123)
       ├─systemd-logind(335)
       ├─systemd-timesyn(286)───{systemd-timesyn}(323)
       ├─systemd-udevd(145)
       ├─thd(354)
       ├─wpa_supplicant(338)
       └─wpa_supplicant(431)

Now there are multiple new ones running in addition to the originals. This can have very unexpected behaviors, especially when deployments restart greengrass.

To fix this issue, the killProcess code should be rewritten to use the child subreaper functionality of Linux 3.4, and the per-process kill command should be sent to the entire process group for the component (either using killpg, or kill with the negative of the component's lead pid. (cf man 2 kill)

cthulhuology avatar Apr 08 '21 13:04 cthulhuology

Hi Dave, When Greengrass attempts to kill your processes we will find all the PIDs of all children under the shell that Greengrass spawns. Greengrass with then send SIGTERM to them and later SIGKILL.

If you enable debug level logs you will be able to see exactly which PIDs Greengrass found and what kill commands we are using. This will help to know why the components aren't being killed so that we can know if they aren't being identified or they are not stopping in response to the signals.

Additionally, you can increase the shutdown timeout by adding a shutdown section into your recipe lifecycle with an added timeout field.

MikeDombo avatar Apr 08 '21 16:04 MikeDombo

Hi Mike,

Getting the list of PIDs doesn't work for any process which uses a supervisor process to manage child processes. For example, there's a very common pattern for managing incoming socket connections where the parent process accepts and forks so that each request is handled in a child process. These processes are then zombies which get reparented to init and never properly cleaned up until reboot. It would be better for Greengrass to send the SIGTERM and SIGKILL to the parent process's process group so that all of the child processes are delivered the signal by the Linux kernel itself, rather than trying to iterate over the list.

Dave

cthulhuology avatar Apr 20 '21 10:04 cthulhuology

Closing as using the shutdown lifecycle method is the correct way to handle this case. The component should implement whatever command or commands are necessary in order to fully shutdown the component's process(es).

MikeDombo avatar Sep 16 '22 14:09 MikeDombo