nomad icon indicating copy to clipboard operation
nomad copied to clipboard

exec driver leaks executor process after `StartTask` error

Open tantra35 opened this issue 3 years ago • 4 comments

Nomad version

Output from Nomad v1.1.10 (2f08fe230da05e1b179710ebe0e2582249599a4b+CHANGES)

Operating system and Environment details

Ubuntu 20.04

Issue

If we use unhallowed caps for exec driver after faill we got leeaking nomad exec processes

Reproduction steps

For example if we use net_raw caps that doens't allowed by default for exec driver

job testnetworknamespace
{
	region = "global"
	datacenters = ["test"]

	update
	{
		stagger = "1m"
		min_healthy_time = "1m"
		max_parallel = 1
		health_check="checks"
		healthy_deadline = "3m"
		progress_deadline = "6m"
		auto_revert = true
	}

	group testservicecheck
	{
		restart {
			attempts = 2
			delay    = "15s"
		}

		task testservicecheck
		{
			driver = "exec"
			leader=true

			config
			{
				cap_add = ["net_raw"]

				command = "sleep"
				args = ["6000"]
			}

			logs
			{
				max_files = 3
				max_file_size = 10
			}

			resources
			{
				memory = 300
				cpu = 100
			}
		}
	}
} 

after allocation on node fail with follow task state(which is absolutely expected behavior)

Recent Events:
Time                       Type            Description
2022-01-28T20:22:47+03:00  Killing         Sent interrupt. Waiting 5s before force killing
2022-01-28T20:22:47+03:00  Not Restarting  Error was unrecoverable
2022-01-28T20:22:47+03:00  Driver Failure  driver does not allow the following capabilities: net_raw
2022-01-28T20:22:45+03:00  Task Setup      Building Task Directory
2022-01-28T20:22:40+03:00  Received        Task received by client

on client node we got leaked nomad executor processes (here we demonstrate some output of ps axuf)

dnsmasq    33659  0.0  0.2  13932  2088 ?        S    19:16   0:00 /usr/sbin/dnsmasq -x /run/dnsmasq/dnsmasq.pid -u dnsmasq -7 /etc/dnsmasq.d,.dpkg-dist,.dpkg-old,.dpkg-new --local-service --trust-anchor=.,20326,8,2,e0
root       33756  0.6  5.6 1363452 56400 ?       Ssl  19:16   0:25 /opt/nomad/nomad agent -config=/etc/nomad/
root       34470  0.0  3.0 1287848 30340 ?       Ssl  19:23   0:00  \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/0d35c0b9-5a61-adca-d070-413a1ee7ede6/testservicecheck/executor.out"
root       34893  0.0  3.0 1287848 30184 ?       Ssl  19:26   0:00  \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/ca50587b-fa49-e422-2a7e-84f582147343/testservicecheck/executor.out"
root       38194  0.0  2.9 1509044 29924 ?       Ssl  20:05   0:00  \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/006bd711-10c8-c230-9da1-b4182f826f8a/testservicecheck/executor.out"
root       38460  0.0  3.0 1287848 30892 ?       Ssl  20:07   0:00  \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/6586763d-2fe8-9a89-a9e0-591d26461739/testservicecheck/executor.out"
root       38764  0.0  3.0 1287848 31008 ?       Ssl  20:09   0:00  \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/52ccc789-89d0-23b4-d3ac-1408e6254ded/testservicecheck/executor.out"
root       40194  0.0  3.0 1361580 30492 ?       Ssl  20:22   0:00  \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/c0d99d3f-3d47-dbb7-833c-054a4ef25721/testservicecheck/executor.out"
root       33760  0.2  2.6 175836 27048 ?        Ssl  19:16   0:12 /opt/consul/consul agent -config-dir=/etc/consul -advertise=192.168.102.22

tantra35 avatar Jan 28 '22 17:01 tantra35

Thanks for raising this @tantra35, from a quick look at the information you provided (thanks for all the details!) I suspect we're missing some clean-up in an error code path.

lgfa29 avatar Feb 02 '22 18:02 lgfa29

@lgfa29 could you please tell is it possible expect a fix soon?

tantra35 avatar Feb 03 '22 22:02 tantra35

We don't have a date for a fix. I placed this into our backlog for further triaging.

lgfa29 avatar Feb 03 '22 23:02 lgfa29

Doing some issue cleanup and wanted to confirm that this is still the case even after some improvements we've made recently to the exec driver's process cleanup. Using the following jobspec:

minimal jobspec
job "example" {
  group "sleep" {
    task "sleep" {

      driver = "exec"
      user   = "ubuntu"

      config {
        command = "sleep"
        args    = ["300"]
        cap_add = ["net_raw"]
      }
    }
  }
}

We get task events like the following (as expected):

Recent Events:
Time                       Type            Description
2024-06-24T14:40:00-04:00  Not Restarting  Error was unrecoverable
2024-06-24T14:40:00-04:00  Driver Failure  driver does not allow the following capabilities: net_raw
2024-06-24T14:40:00-04:00  Task Setup      Building Task Directory
2024-06-24T14:40:00-04:00  Received        Task received by client

But after a couple of restarts we get leaked executor processes as reported above:

$ ps afx
...
   1997 ?        Ssl    0:01 /usr/local/bin/nomad agent -config /etc/nomad.d
   2131 ?        Ssl    0:00  \_ /usr/local/bin/nomad executor {"LogFile":"/var/nomad/data/alloc/91bdfcf2-9972-5985-8cd7-62a5d566e193/sleep/executor.out
   2166 ?        Ssl    0:00  \_ /usr/local/bin/nomad executor {"LogFile":"/var/nomad/data/alloc/7599c82e-831f-7699-33f4-c6ab8da2655f/sleep/executor.out

I'm going to re-title this slightly and mark it for roadmapping. I'll also note from a quick look at the code that it almost certainly impacts the java driver and possibly the raw_exec driver as well, but haven't tested that.

tgross avatar Jun 24 '24 18:06 tgross

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions[bot] avatar Apr 02 '25 02:04 github-actions[bot]