nomad
nomad copied to clipboard
Rogue Processes
Nomad version
Nomad v0.9.6 (1f8eddf2211d064b150f141c86e30d9fceabec89)
Operating system and Environment details
Debian 9.11
Issue
We are noticing on some long running clients, there were rogue processes which have been disassociated from the Nomad process. We first discovered this when we had reports of inconsistency in our production application which could only be explained by an old version of an application running somewhere.
The rogue processes were all from Service jobs, using the Java driver.
Here is an example with a process tree from one of our Nomad clients:
VM uptime: 59+ days
Nomad agent uptime: Since end of Dec 2019
The nomad agent was not restarted/killed - which we thought could explain the run-away processes.
The rogue processes (5 of them) are at the very bottom of the list:
root 726 2.2 0.1 2324544 102664 ? Ssl 2019 1884:38 /usr/bin/dockerd -H fd:// -H tcp://127.0.0.1:2376
root 834 0.6 0.0 979068 17808 ? Ssl 2019 586:06 \_ docker-containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/contain
root 30890 0.2 0.0 356032 3852 ? Sl Jan11 43:35 | \_ docker-containerd-shim a35d82c4ad8e3f68f06afc4cbaa293041c2491661c51b70ed0d6e53e4985e11a /var/run/docker/libcontainerd/a35d82c4ad8e3f68f06afc4cbaa293041c2491661c51b70e
consul 30909 0.0 0.0 4332 716 ? Ss Jan11 0:00 | | \_ /bin/sh -c /entrypoint.sh
consul 30934 0.0 0.0 21772 3172 ? S Jan11 0:00 | | \_ /bin/bash -x /entrypoint.sh
consul 30981 3.6 6.8 10644212 3671764 ? Sl Jan11 642:56 | | \_ /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -server -Xms4g -Xmx4g -XX:NewRatio=3 -XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshol
root 1195 0.0 0.0 413372 3820 ? Sl Jan11 0:41 | \_ docker-containerd-shim 82877e0bd7ecee15d090a1d8f7e7ab255756c4403c2b51591d1612ce0a191a1d /var/run/docker/libcontainerd/82877e0bd7ecee15d090a1d8f7e7ab255756c4403c2b5159
root 1213 0.9 0.1 858964 67304 ? Ssl Jan11 174:22 | | \_ filebeat -e
root 1389 0.0 0.0 415292 3896 ? Sl Jan14 0:08 | \_ docker-containerd-shim 194ff7254722d84d6330b5228695b16788504fb10719ee35ae3f8dd428f0181f /var/run/docker/libcontainerd/194ff7254722d84d6330b5228695b16788504fb10719ee35
mnguyen 1407 0.0 0.0 18052 2964 ? Ss Jan14 0:00 | \_ /bin/bash /opt/bin/entry_point.sh
mnguyen 1452 0.0 0.0 4504 1640 ? S Jan14 0:00 | \_ /bin/sh /usr/bin/xvfb-run -n 99 --server-args=-screen 0 1360x1020x24 -ac +extension RANDR java -jar /opt/selenium/selenium-server-standalone.jar -role node -h
mnguyen 1463 0.0 0.0 250068 42996 ? Sl Jan14 1:07 | \_ Xvfb :99 -screen 0 1360x1020x24 -ac +extension RANDR -nolisten tcp -auth /tmp/xvfb-run.IvXjC8/Xauthority
mnguyen 1478 0.1 0.7 19186364 399060 ? Sl Jan14 24:29 | \_ java -jar /opt/selenium/selenium-server-standalone.jar -role node -hub http://selenium-hub.service.consul:4444/grid/register -nodeConfig /opt/selenium/con
mnguyen 3122 0.0 0.0 106920 7020 ? Sl Jan17 0:00 | \_ /opt/selenium/chromedriver-2.29 --port=23631
root 30871 0.0 0.0 118340 3560 ? Sl Jan11 0:00 \_ /usr/bin/docker-proxy -proto tcp -host-ip 10.201.20.39 -host-port 8983 -container-ip 172.17.0.2 -container-port 8983
root 30883 0.0 0.0 190408 3392 ? Sl Jan11 0:00 \_ /usr/bin/docker-proxy -proto udp -host-ip 10.201.20.39 -host-port 8983 -container-ip 172.17.0.2 -container-port 8983
root 1369 0.0 0.0 108480 3536 ? Sl Jan14 0:00 \_ /usr/bin/docker-proxy -proto tcp -host-ip 10.201.20.39 -host-port 25597 -container-ip 172.17.0.3 -container-port 25597
root 1380 0.0 0.0 190408 3388 ? Sl Jan14 0:00 \_ /usr/bin/docker-proxy -proto udp -host-ip 10.201.20.39 -host-port 25597 -container-ip 172.17.0.3 -container-port 25597
root 732 0.0 0.0 118640 15560 ? Ssl 2019 10:12 /usr/bin/google_osconfig_agent
ntp 787 0.0 0.0 97896 3684 ? Ssl 2019 5:15 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -c /run/ntp.conf.dhcp -u 106:110
consul 1030 0.8 0.1 182772 61556 ? Ssl 2019 729:04 /usr/local/bin/consul agent -config-dir=/etc/consul
root 12377 2.7 0.1 3922380 97740 ? Ssl 2019 1674:46 /usr/local/bin/nomad agent -config=/etc/nomad/config.hcl
root 9011 0.3 0.0 665396 17472 ? Sl Jan11 63:06 \_ /usr/local/bin/nomad logmon
root 9025 0.2 0.0 1100776 27644 ? Ssl Jan11 42:36 \_ /usr/local/bin/nomad executor {"LogFile":"/var/lib/nomad/alloc/959fd52e-5658-24af-13e6-f39229e54b00/statsite/executor.out","LogLevel":"debug","FSIsolation":true}
nobody 9041 0.1 0.0 207192 2508 ? Ss Jan11 19:45 | \_ /local/statsite -f local/statsite.conf
root 10058 0.3 0.0 664884 17276 ? Sl Jan11 63:50 \_ /usr/local/bin/nomad logmon
root 10085 0.2 0.0 1362920 27676 ? Ssl Jan11 43:28 \_ /usr/local/bin/nomad executor {"LogFile":"/var/lib/nomad/alloc/406d6e65-8241-b879-8b9e-0dc18f06970b/cloud-config/executor.out","LogLevel":"debug","FSIsolation":true}
nobody 10102 0.7 1.4 4705208 755628 ? Ssl Jan11 124:37 | \_ /usr/lib/jvm/jdk1.8.0_161/bin/java -Xmx512m -server -jar local/cloud-config-server-1.0.0-2a64a95.jar
root 30366 0.4 0.0 1141212 18164 ? Sl Jan11 78:23 \_ /usr/local/bin/nomad logmon
root 30938 0.1 0.0 581804 18548 ? Sl Jan11 21:03 \_ /usr/local/bin/nomad docker_logger
root 1129 0.3 0.0 729268 17688 ? Sl Jan11 55:08 \_ /usr/local/bin/nomad logmon
root 1241 0.0 0.0 507816 18136 ? Sl Jan11 0:10 \_ /usr/local/bin/nomad docker_logger
root 12209 0.3 0.0 598452 18128 ? Sl Jan11 62:31 \_ /usr/local/bin/nomad logmon
root 12443 0.2 0.0 1305580 28372 ? Ssl Jan11 44:02 \_ /usr/local/bin/nomad executor {"LogFile":"/var/lib/nomad/alloc/e2eacc16-de4b-416d-95c6-b4af7644672d/cobwebs-operations/executor.out","LogLevel":"debug","FSIsolation":true
nobody 12460 0.3 1.9 10270904 1021288 ? Ssl Jan11 57:15 | \_ /usr/lib/jvm/jdk1.8.0_161/bin/java -server -Xmx4096m -Dnetworkaddress.cache.ttl=30 -javaagent:local/newrelic/newrelic.jar -jar local/cobwebs-operations-gateway-7.0.31
root 569 0.3 0.0 665140 18620 ? Sl Jan14 44:25 \_ /usr/local/bin/nomad logmon
root 1435 0.0 0.0 509736 18656 ? Sl Jan14 0:04 \_ /usr/local/bin/nomad docker_logger
root 15625 0.4 0.0 655792 17268 ? Sl Jan21 12:45 \_ /usr/local/bin/nomad logmon
root 15638 0.2 0.0 1231848 28036 ? Ssl Jan21 7:56 \_ /usr/local/bin/nomad executor {"LogFile":"/var/lib/nomad/alloc/f64e8b49-df3d-bad2-fb82-a11bfb549f56/gcsproxy/executor.out","LogLevel":"debug","FSIsolation":true}
nobody 15653 0.0 0.0 481244 14596 ? Ssl Jan21 0:11 | \_ /local/gcsproxy -b 127.0.0.1:38080
root 2300 0.4 0.0 663732 18188 ? Sl Jan22 7:53 \_ /usr/local/bin/nomad logmon
root 28687 0.2 0.0 1051088 27224 ? Ssl 16:51 0:01 \_ /usr/local/bin/nomad executor {"LogFile":"/var/lib/nomad/alloc/921c59b3-d46a-6091-699f-6e36e3a7981d/report-web-9-avatica/executor.out","LogLevel":"debug","FSIsolation":tr
nobody 28703 21.0 1.3 9999776 716224 ? Ssl 16:51 1:54 \_ /usr/lib/jvm/jdk1.8.0_161/bin/java -server -Xmx4g -XX:+UseConcMarkSweepGC -Djava.net.preferIPv4Stack=true -jar local/report-web-9.2.1-440a00da2-b150.jar
nobody 4928 0.4 4.5 10318440 2417572 ? Ssl 2019 209:13 /usr/lib/jvm/jdk1.8.0_161/bin/java -Xmx4096m -server -Dnetworkaddress.cache.ttl=30 -Dnewrelic.config.max_stack_trace_lines=200 -javaagent:local/newrelic/newrelic.jar -jar loc
root 24087 0.0 0.0 776188 4524 ? Ssl 2019 7:54 /usr/sbin/collectd
root 25192 0.1 0.1 1093904 56120 ? Ssl 2019 78:59 /opt/stackdriver/collectd/sbin/stackdriver-collectd -C /etc/stackdriver/collectd.conf -P /var/run/stackdriver-agent.pid
nobody 7514 0.9 2.3 10324996 1259880 ? Ssl 2019 492:31 /usr/lib/jvm/jdk1.8.0_161/bin/java -Xmx4096m -server -Dnetworkaddress.cache.ttl=30 -Dnewrelic.config.max_stack_trace_lines=200 -javaagent:local/newrelic/newrelic.jar -jar loc
nobody 28793 3.2 1.8 8024688 982448 ? Ssl Jan11 580:25 /usr/lib/jvm/jdk1.8.0_161/bin/java -Xmx2048m -server -Dnetworkaddress.cache.ttl=30 -javaagent:local/newrelic/newrelic.jar -jar local/auth-service-gateway-1.76.0.jar
nobody 6505 0.0 1.3 6328908 698560 ? Ssl Jan11 15:39 /usr/lib/jvm/jdk1.8.0_161/bin/java -server -Dnetworkaddress.cache.ttl=30 -Xmx1024m -jar local/search-service-gateway-1.75.0.jar
nobody 18169 0.2 1.3 7307912 706768 ? Ssl Jan17 19:14 /usr/lib/jvm/jdk1.8.0_161/bin/java -Xmx2048m -server -Dnetworkaddress.cache.ttl=30 -javaagent:local/newrelic/newrelic.jar -jar local/auth-service-gateway-1.76.0.jar
nobody 6299 0.2 1.2 7307912 686040 ? Ssl Jan11 18:58 /usr/lib/jvm/jdk1.8.0_161/bin/java -Xmx2048m -server -Dnetworkaddress.cache.ttl=30 -javaagent:local/newrelic/newrelic.jar -jar local/auth-service-gateway-1.75.0.jar
Reproduction steps
N/A
Job file (if appropriate)
Nomad Client logs (if appropriate)
Nomad Server logs (if appropriate)
Hi @vincenthuynh! Thanks for reporting this!
Nomad supervises exec
and java
with an executor
process (you can see this in the lines for /usr/local/bin/nomad executor
). It looks like if this executor process dies unexpectedly, its child process -- the task workload -- is re-parented to PID1 and not reaped. You can reproduce this pretty easily by running kill -9
against the executor.
This looks like a bug. If you were to do the same to the container-shim
process running for a Docker task, you'd see that containerd
remains running and becomes the parent of a new container-shim
process when Nomad reconciles the state of the world and restarts the allocation. I'll mark this as a bug and make sure it gets into our schedule.
Oh, also noting here that this is reproducible on any recent version of Nomad 0.10.x as well!
Wanted to follow up here with some info. We don't expect executor
process to die unexpectedly in normal operations. And if executor dies, I'd expect the java process stdout|stderr
to close and for the process to receive a SIGPIPE signal on the next write and die (unless they have custom signal handler).
Would love some info about how executor dies? Would you be able to dig into nomad logs to find any indication of how it's killed e.g. it panicked vs got killed by OOM? The cause can help us have some targeted remediation. Can you inspect where stdout
is pointing to (e.g. sudo ls -lha /proc/18169/fd/
)? Does it not emit any logs to stdout|stderr
?
This reminds me of https://github.com/hashicorp/nomad/pull/5434 - I'll resurrect the PR and aim to include it to address this bug. Thanks for raising it.
We went through and reaped any run-away processes this morning. I'll keep an eye on it again after the weekend and grab some logs if I find any.
I was able to find a run-away process on one of our other nodes:
$ps
...
nobody 30815 0.1 1.0 6982276 554908 ? Ssl Jan11 45:51 /usr/lib/jvm/jdk1.8.0_161/bin/java -server -Xmx1024m -jar local/email-gateway-1.28.0.jar
$ls -lha /proc/30815/fd/
lrwx------ 1 nobody nogroup 64 Jan 11 11:01 0 -> /null
l-wx------ 1 nobody nogroup 64 Jan 11 11:01 1 -> /var/lib/nomad/alloc/6ec14a7a-bc14-7f6c-d9b2-7db650e8f85c/alloc/logs/.email-service.stdout.fifo (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 10 -> /urandom
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 11 -> /random
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 12 -> /random
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 13 -> /urandom
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 14 -> /urandom
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 15 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/ext/sunec.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 16 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/ext/sunjce_provider.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 17 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/resources.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 18 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/charsets.jar (deleted)
l-wx------ 1 nobody nogroup 64 Jan 11 11:01 2 -> /var/lib/nomad/alloc/6ec14a7a-bc14-7f6c-d9b2-7db650e8f85c/alloc/logs/.email-service.stderr.fifo (deleted)
lr-x------ 1 nobody nogroup 64 Jan 11 11:01 3 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/rt.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 11 11:01 4 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/ext/jfxrt.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 11 11:01 5 -> /local/email-gateway-1.28.0.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 11 11:01 6 -> /local/email-gateway-1.28.0.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 11 11:01 7 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/jce.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 8 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/jsse.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 9 -> /random
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 91 -> /local/email-gateway-1.28.0.jar (deleted)
I will dump the nomad agent logs (in debug level!) and send it over.
Hi again, We have some more information that might help with debugging the orphaned processes:
Below is a job that uses a static port. It unexpectedly terminates and then is unable to restart on the same client due to a port conflict. The original process is left running on the client despite the parent process being killed.
Alloc id: 5b3d3abb
Recent Events:
Time Type Description
2020-02-27T11:01:46-05:00 Killing Sent interrupt. Waiting 5s before force killing
2020-02-27T11:01:46-05:00 Not Restarting Exceeded allowed attempts 2 in interval 30m0s and mode is "fail"
2020-02-27T11:01:46-05:00 Terminated Exit Code: 1
2020-02-27T11:01:25-05:00 Started Task started by client
2020-02-27T11:00:59-05:00 Restarting Task restarting in 18.599244521s
2020-02-27T11:00:59-05:00 Terminated Exit Code: 1
2020-02-27T11:00:30-05:00 Started Task started by client
2020-02-27T11:00:12-05:00 Restarting Task restarting in 18.212077249s
2020-02-27T11:00:12-05:00 Terminated Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = transport is closing"
2020-02-23T08:06:08-05:00 Started Task started by client
Relevant client logs:
Feb 27 11:00:11 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:11.398-0500 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=319.024µs
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.160-0500 [WARN] client.alloc_runner.task_runner.task_hook.script_checks: tasklet timed out: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e task=timespent-app timeout=3s
Feb 27 11:00:12 nmd-z8g1 consul[1070]: 2020/02/27 11:00:12 [INFO] agent: Synced check "_nomad-check-96ccbf0069efd2438c488867684b1e5ffb1af85d"
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: panic: send on closed channel: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: : alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: goroutine 11935 [running]:: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: github.com/hashicorp/nomad/drivers/shared/executor.(*LibcontainerExecutor).handleExecWait(0xc00014c620, 0xc000468ba0, 0xc00010c640): alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: #011/opt/gopath/src/github.com/hashicorp/nomad/drivers/shared/executor/executor_linux.go:549 +0x11c: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: created by github.com/hashicorp/nomad/drivers/shared/executor.(*LibcontainerExecutor).Exec: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: #011/opt/gopath/src/github.com/hashicorp/nomad/drivers/shared/executor/executor_linux.go:470 +0x2ae: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.635-0500 [ERROR] client.driver_mgr.java: error receiving stream from Stats executor RPC, closing stream: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app error="rpc error: code = Unavailable desc = transport is closing"
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.635-0500 [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e task=timespent-app error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: error while dialing: dial unix /tmp/plugin732811969: connect: connection refused""
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.635-0500 [DEBUG] client.driver_mgr.java.executor: plugin process exited: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app path=/usr/local/bin/nomad pid=25671 error="exit status 2"
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.637-0500 [INFO] client.alloc_runner.task_runner: restarting task: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e task=timespent-app reason="Restart within policy" delay=18.212077249s
Feb 27 11:00:12 nmd-z8g1 consul[1070]: 2020/02/27 11:00:12 [INFO] agent: Deregistered service "_nomad-task-5b3d3abb-4175-c539-7a6d-07d6c2713d3e-timespent-app-timespent-actuator-http"
Feb 27 11:00:12 nmd-z8g1 consul[1070]: 2020/02/27 11:00:12 [INFO] agent: Deregistered check "_nomad-check-96ccbf0069efd2438c488867684b1e5ffb1af85d"
Hi @tgross @notnoop Could someone shed some light on what may be causing this error with the Java driver? We tend to see the rouge processes happen on a client when an allocation has this event message.
In the meantime, we are going to stop using the Java driver until this is resolved as the unregistered services are draining the resources from our clients.
Thanks!
2020-02-27T11:00:12-05:00 Terminated Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = transport is closing"
Just had the same issue for an exec
task running on Nomad v0.12.5
In case anyone is interested... alleviating it using this script until we get a proper fix:
#!/usr/bin/env bash
set -exuo pipefail
while true; do
for pid in $(ps -eo 'ppid= uid= pid=' | egrep '^\s*1 65534' | awk '{ print $3 }'); do
echo "killing $pid"
ps "$pid"
kill "$pid"
sleep 30
kill -9 "$pid" || true
done
sleep 60
done
So after a lot of fighting this issue and cleaning up after Nomad, here's some code that helps prevent the issue when you're using bash for your command and have trouble getting all its child processes to die:
cleanup() {
# kill all processes whose parent is this process
pkill -P $$
}
for sig in INT QUIT HUP TERM; do
trap "
cleanup
trap - $sig EXIT
kill -s $sig "'"$$"' "$sig"
done
trap cleanup EXIT
credit: https://aweirdimagination.net/2020/06/28/kill-child-jobs-on-script-exit/
I still think that this should be handled better by Nomad, the process tree under the nomad executor should hopefully be relatively easy to identify and force kill.
Think I've seen the same issue on Nomad 1.1.2 with an exec task (on CentOS 7.4). I'll try to keep an eye out and get some more information if it happens again.
We've had this happen for couple of years, currently on Nomad 1.4.2.
It happens fairly consistently with few (like 5 out of 40) exec-tasks that are using bash
-script or tini
running bash
-script as command. tini
as an entrypoint was a failed attempt to mitigate this problem before understanding it any better, and even though it did not help, it was left in use. At least two of these tasks are always the same.
This only happens when our agent certificates get rotated by Ansible, which replaces the certificate files on the disk (before the old got invalid), and restarts those Nomad agents that had their certificates changed. The restart is done one agent at a time using service=name state=restarted
, which I believe does systemctl restart
. Nomad agents are running as SystemD services using mostly the SystemD service unit from the Nomad Deployment Guide.
I have rarely had this happen to me when doing manual restarts, but for those few tasks it happens on almost every time Ansible does the restart. But it still is random, and not 100% consistent.
Here's a pstree -p
showing that the executor has died from 4199
:
├─nomad(3762493)─┬─nomad(505783)─┬─{nomad}(505784)
│ │ ├─{nomad}(505785)
│ │ ├─{nomad}(505786)
...
│ ├─nomad(505910)─┬─tini(505936)───.graphql-oidc-a(505943)
│ │ ├─{nomad}(505911)
│ │ ├─{nomad}(505912)
...
│ ├─nomad(505956)─┬─tini(505969)───python(505975)
│ │ ├─{nomad}(505957)
│ │ ├─{nomad}(505958)
...
|-tini(4058)---python(4199)
...
For now I've just cleaned the current tasks, and don't have timestamps on hand for the last time this happened (to check the logs), but if there's something I should check that would help with this, the next time this should happen is in 2-3 weeks.
Running on RHEL 8.7.
The Nomad task for the forementioned process that most of the time seems to get it's executor killed is:
task "zeoserver" {
driver = "exec"
resources {
cpu = 100
memory = 512
memory_max = 768
}
volume_mount {
volume = "var"
destination = "/var/lib/plone"
read_only = false
}
env {
PLONE_ZEOSERVER_ADDRESS = "${NOMAD_ADDR_zeo}"
PLONE_VAR = "/var/lib/plone"
}
config {
command = "/usr/local/bin/tini"
args = [
"--",
"/usr/local/bin/sh",
"-c",
<<EOH
set -e
if [ ! -e /bin ]; then
ln -s /usr/local/bin /bin
fi
exec plonectl-zeoserver
EOH
]
}
artifact {
source = "${local.repository}/${var.artifact}/${var.artifact}-${var.version}.tar.gz"
destination = "/"
}
}
The /bin
linking is because we are using extremely stripped down chroot on Nomad, since all of our artifacts are created with Nix, and are mostly self-contained. As said earlier, the run-away problem also happened before tini
when we used /usr/local/bin/sh
as command
.
plonectl-zeoserver
at the end is in itself a wrapper bash script running the actual exec /nix/store/.../bin/python -m plonectl.cli ...
at the end of itself.
Other run-away tasks that have been caught includes Nginx, Redis and some Java apps run in the same fashion.
@zemm give the -g
flag in tini a try. This will propagate any signals to all processes in the same process group. This will potentially prevent these rogue processes from being orphaned and become zombies.
Ref: https://github.com/krallin/tini#process-group-killing
I just realized you are using exec
, which has a different behaviour compared to running the process in the background with &
. Try adding the -g
flag to tini and change exec plonectl-zeoserver
to
plonectl-zeoserver &
wait
This will ensure that the plonectl-zeoserver process stays in the same process group as your sh
so when tini
propagates the Nomad signal, it is properly handled!
Anyway, my .2c!
I see in 1.4.2 that docker orphans its containers. I am not sure if it is related to a restart of the nomad service client (not the node itself), but we definitely see this when containers are static ports, and obviously the port is already in use.
Note for internal HashiCorp people, see NOMAD-146 in for more info from a customer.
For anybody following: we're likely picking this up next week.
Noting what seems to be another report in https://github.com/hashicorp/nomad/issues/19062
Further research
-
I was able to verify that when the executor process is killed the workload process continues running and is reparented to systemd.
-
This issue does not occur on
-dev
nodes because they have special logic to completely kill workloads started by the node before exiting.- This behavior is intentionally not triggered when Nomad is running in production mode in order to enable Nomad agent upgrades without disrupting the running workload.
-
There appears to be configuration options that can be passed to both golang's os.Exec and libcontainer's Run calls.
The syscall.SysProcAttr
struct contains a Pdeathsig Signal
field that requests the OS to signal any children spawned by the process when the parent dies and provides the Signal to send to them.
NOTE: There is a significant comment embedded in the struct
// Pdeathsig, if non-zero, is a signal that the kernel will send to
// the child process when the creating thread dies. Note that the signal.
// is sent on thread termination, which may happen before process termination.
// There are more details at https://go.dev/issue/27505.
However, this seems to be the correct way to watchdog the child work using the OS. There is prior art in containerd/go-runc, containerd/containerd (a test), & moby/moby
os/exec
For os/exec.Start
, the provided Cmd
a SysProcAttr
field that takes a pointer to a syscall.SysProcAttr
struct.
libcontainer
The libcontainer/configs.Config
struct contains a ParentDeathSignal
field that performs the same function. It appears that (l *LibcontainerExecutor) newLibcontainerConfig
is where we build the Config that is ultimately run.
There are further comments of note in libcontainer in the newParentProcess
function:
https://github.com/opencontainers/runc/blob/ab146f23357d867f3ed9a15c4e15507f78eba3b2/libcontainer/container_linux.go#L512-L514
echoed later in
https://github.com/opencontainers/runc/blob/ab146f23357d867f3ed9a15c4e15507f78eba3b2/libcontainer/container_linux.go#L521-L525
Unfortunately I ran out of time to work further on this issue during this period. Hopefully these additional notes might help the next person who picks up this issue.