nomad
nomad copied to clipboard
exec driver leaks executor process after `StartTask` error
Nomad version
Output from Nomad v1.1.10 (2f08fe230da05e1b179710ebe0e2582249599a4b+CHANGES)
Operating system and Environment details
Ubuntu 20.04
Issue
If we use unhallowed caps for exec driver after faill we got leeaking nomad exec processes
Reproduction steps
For example if we use net_raw caps that doens't allowed by default for exec driver
job testnetworknamespace
{
region = "global"
datacenters = ["test"]
update
{
stagger = "1m"
min_healthy_time = "1m"
max_parallel = 1
health_check="checks"
healthy_deadline = "3m"
progress_deadline = "6m"
auto_revert = true
}
group testservicecheck
{
restart {
attempts = 2
delay = "15s"
}
task testservicecheck
{
driver = "exec"
leader=true
config
{
cap_add = ["net_raw"]
command = "sleep"
args = ["6000"]
}
logs
{
max_files = 3
max_file_size = 10
}
resources
{
memory = 300
cpu = 100
}
}
}
}
after allocation on node fail with follow task state(which is absolutely expected behavior)
Recent Events:
Time Type Description
2022-01-28T20:22:47+03:00 Killing Sent interrupt. Waiting 5s before force killing
2022-01-28T20:22:47+03:00 Not Restarting Error was unrecoverable
2022-01-28T20:22:47+03:00 Driver Failure driver does not allow the following capabilities: net_raw
2022-01-28T20:22:45+03:00 Task Setup Building Task Directory
2022-01-28T20:22:40+03:00 Received Task received by client
on client node we got leaked nomad executor processes (here we demonstrate some output of ps axuf)
dnsmasq 33659 0.0 0.2 13932 2088 ? S 19:16 0:00 /usr/sbin/dnsmasq -x /run/dnsmasq/dnsmasq.pid -u dnsmasq -7 /etc/dnsmasq.d,.dpkg-dist,.dpkg-old,.dpkg-new --local-service --trust-anchor=.,20326,8,2,e0
root 33756 0.6 5.6 1363452 56400 ? Ssl 19:16 0:25 /opt/nomad/nomad agent -config=/etc/nomad/
root 34470 0.0 3.0 1287848 30340 ? Ssl 19:23 0:00 \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/0d35c0b9-5a61-adca-d070-413a1ee7ede6/testservicecheck/executor.out"
root 34893 0.0 3.0 1287848 30184 ? Ssl 19:26 0:00 \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/ca50587b-fa49-e422-2a7e-84f582147343/testservicecheck/executor.out"
root 38194 0.0 2.9 1509044 29924 ? Ssl 20:05 0:00 \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/006bd711-10c8-c230-9da1-b4182f826f8a/testservicecheck/executor.out"
root 38460 0.0 3.0 1287848 30892 ? Ssl 20:07 0:00 \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/6586763d-2fe8-9a89-a9e0-591d26461739/testservicecheck/executor.out"
root 38764 0.0 3.0 1287848 31008 ? Ssl 20:09 0:00 \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/52ccc789-89d0-23b4-d3ac-1408e6254ded/testservicecheck/executor.out"
root 40194 0.0 3.0 1361580 30492 ? Ssl 20:22 0:00 \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/c0d99d3f-3d47-dbb7-833c-054a4ef25721/testservicecheck/executor.out"
root 33760 0.2 2.6 175836 27048 ? Ssl 19:16 0:12 /opt/consul/consul agent -config-dir=/etc/consul -advertise=192.168.102.22
Thanks for raising this @tantra35, from a quick look at the information you provided (thanks for all the details!) I suspect we're missing some clean-up in an error code path.
@lgfa29 could you please tell is it possible expect a fix soon?
We don't have a date for a fix. I placed this into our backlog for further triaging.
Doing some issue cleanup and wanted to confirm that this is still the case even after some improvements we've made recently to the exec driver's process cleanup. Using the following jobspec:
minimal jobspec
job "example" {
group "sleep" {
task "sleep" {
driver = "exec"
user = "ubuntu"
config {
command = "sleep"
args = ["300"]
cap_add = ["net_raw"]
}
}
}
}
We get task events like the following (as expected):
Recent Events:
Time Type Description
2024-06-24T14:40:00-04:00 Not Restarting Error was unrecoverable
2024-06-24T14:40:00-04:00 Driver Failure driver does not allow the following capabilities: net_raw
2024-06-24T14:40:00-04:00 Task Setup Building Task Directory
2024-06-24T14:40:00-04:00 Received Task received by client
But after a couple of restarts we get leaked executor processes as reported above:
$ ps afx
...
1997 ? Ssl 0:01 /usr/local/bin/nomad agent -config /etc/nomad.d
2131 ? Ssl 0:00 \_ /usr/local/bin/nomad executor {"LogFile":"/var/nomad/data/alloc/91bdfcf2-9972-5985-8cd7-62a5d566e193/sleep/executor.out
2166 ? Ssl 0:00 \_ /usr/local/bin/nomad executor {"LogFile":"/var/nomad/data/alloc/7599c82e-831f-7699-33f4-c6ab8da2655f/sleep/executor.out
I'm going to re-title this slightly and mark it for roadmapping. I'll also note from a quick look at the code that it almost certainly impacts the java driver and possibly the raw_exec driver as well, but haven't tested that.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.