nomad failed pause containers result in failed task restarts and orphaned tasks

trafficstars

Nomad version

$ nomad --version Nomad v1.2.3 (a79efc8422082c4790046c3f5ad92c542592a54f)

Operating system and Environment details

$ uname -a Linux 5.15.13 #3 SMP Wed Jan 26 09:09:47 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux $ cat /etc/os-release NAME="Fedora Linux" VERSION="35 (Thirty Five)" ID=fedora VERSION_ID=35 VERSION_CODENAME="" PLATFORM_ID="platform:f35" PRETTY_NAME="Fedora Linux 35 (Thirty Five)" ANSI_COLOR="0;38;2;60;110;180" LOGO=fedora-logo-icon CPE_NAME="cpe:/o:fedoraproject:fedora:35" HOME_URL="https://fedoraproject.org/" DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f35/system-administrators-guide/" SUPPORT_URL="https://ask.fedoraproject.org/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Fedora" REDHAT_BUGZILLA_PRODUCT_VERSION=35 REDHAT_SUPPORT_PRODUCT="Fedora" REDHAT_SUPPORT_PRODUCT_VERSION=35 PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy" DEFAULT_HOSTNAME=localhost

Issue

When using the docker driver, if one of the nomad_init_<UUID> containers dies, the other containers in the task group do not get cleaned up. We discovered this while doing some general testing around "What happens if you kill random containers in random combinations" I'm aware that a nomad_init_ container failing in production is highly unlikely, but I thought it was worth the report anyway!

Reproduction steps

Create and run a task group as below $ docker container ls 3 containers, name-<UUID>, connect-proxy-name-<UUID> and nomad_init_<UUID> $ docker stop $(id of nomad_init_<UUID>) $ docker container ls 2 containers, name-<UUID>, connect-proxy-name-<UUID> $ systemctl restart nomad $ docker container ls

Expected Result

3 containers, name-<UUID>, connect-proxy-name-<UUID> and nomad_init_<UUID> Where UUID is either the same as before, or a new one

Actual Result

5 containers,

name-<new_UUID>
connect-proxy-name-<new_UUID>
nomad_init_<new_UUID>
name-<old_UUID>
connect-proxy-name-<old_UUID> Nomad has created a new set of containers to replace the task group, but has not cleaned up the existing 2. I would expect it either to re-use them and thus re-create the nomad_init_ container, or delete them and replace them.

Job file (if appropriate)

  group "<redacted>" {
    task "<redacted>" {
      driver = "docker"
      kill_timeout = "20s"
      config {
        image = "<redacted>"
        auth {
          username = "<redacted>"
          password = "<redacted>"
        }
      }
      env {
          <redacted>
      }
      resources {
        cpu    = 200
        memory = 300
      }
    }
    network {
      mode = "bridge"
    }
    service {
      name = "<redacted>"
    }
  }

The docker image is a local one pulled from a private repo but I suspect the exact image is irrelevant.

data_dir  = "<redacted>"

datacenter = "<redacted>"

bind_addr = "0.0.0.0"

log_level = "DEBUG"

server {
  enabled = true
  bootstrap_expect = 1
  raft_protocol = 3
  encrypt = "<redacted>"
}

consul {
  address             = "127.0.0.1:8500"
  server_service_name = "nomad"
  client_service_name = "nomad-client"
  auto_advertise      = true
  server_auto_join    = true
  client_auto_join    = true
  token = "<redacted>"
}

client {
  enabled = true
  servers = ["127.0.0.1"]

  cni_path = "/opt/cni/bin"

  template {
    disable_file_sandbox = true
  }

  host_volume "<redacted>" {
    path      = "<redacted>"
    read_only = false
  }

  host_volume "<redacted>" {
    path      = "<redacted>"
    read_only = false
  }
}

plugin "docker" {
  config {
    allow_caps = ["net_admin", "net_raw", "audit_write", "chown", "dac_override", "fowner", "fsetid", "kill", "mknod", "net_bind_service", "setfcap", "setgid", "setpcap", "setuid", "sys_chroot"]
  }
}

Logs

I can provide logs if you're unable to reproduce, but it'll take a while to redact everything so if you can reproduce it I won't bother

Mar 08 '22 12:03 RosieBaish

Hi @RosieBaish! I was able to more-or-less confirm the behavior you're seeing, and as it turns out the behavior on client restart is only a symptom of a different problem.

Let's use the following jobspec, so that we have a fairly simple application without a bridge network that requires the pause container (nomad_init_$uuid), but cuts out the complexity of Connect:

jobspec

job "example" {
  datacenters = ["dc1"]

  group "web" {

    network {
      mode = "bridge"
      port "www" {
        to = 8001
      }
    }

    task "http" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "8001", "-h", "/local"]
        ports   = ["www"]
      }

      template {
        data        = "<html>hello, world</html>"
        destination = "local/index.html"
      }

      resources {
        cpu    = 128
        memory = 128
      }

    }
  }
}

If we run this job and look at the resulting containers we see the main container and the pause container:

$ docker ps
CONTAINER ID   IMAGE                                      COMMAND                  CREATED              STATUS              PORTS     NAMES
60da9b25e9b5   busybox:1                                  "httpd -v -f -p 8001…"   About a minute ago   Up About a minute             http-8dbf2069-8052-340c-377b-e62d83817dca
88911595dd35   gcr.io/google_containers/pause-amd64:3.1   "/pause"                 About a minute ago   Up About a minute             nomad_init_8dbf2069-8052-340c-377b-e62d83817dca

Now let's kill the pause container:

$ docker kill 889
889

$ docker ps
CONTAINER ID   IMAGE       COMMAND                  CREATED              STATUS              PORTS     NAMES
60da9b25e9b5   busybox:1   "httpd -v -f -p 8001…"   About a minute ago   Up About a minute             http-8dbf2069-8052-340c-377b-e62d83817dca

Note that if we wait a few moments, we can see that Nomad isn't recreating the pause container, which at first glance isn't a problem:

$ nomad alloc status 8db
...
Allocation Addresses (mode = "bridge")
Label  Dynamic  Address
*www   yes      10.0.2.15:27877 -> 8001
...

$ curl  10.0.2.15:27877
<html>hello, world</html>

But it is a problem because if the main container fails, there's no pause container to owns the network namespace anymore (this is a quirk of Docker that a particular container needs to "own" the namespace that we join other containers to, and Nomad inherits that quirk):

$ docker kill 60d
60d

$ nomad alloc status 8db
...
Recent Events:
Time                       Type            Description
2022-03-08T09:57:49-05:00  Killing         Sent interrupt. Waiting 5s before force killing
2022-03-08T09:57:49-05:00  Not Restarting  Error was unrecoverable
2022-03-08T09:57:49-05:00  Driver Failure  Failed to start container 38e4296e7d221f7d4f1dae3ed888e87f967e84ee865efb9954cac036320f411b: API error (409): cannot join network of a non running container: 88911595dd35c2bc2e80f70325ea6485795339c56c70a59f585626016885b85f
2022-03-08T09:57:49-05:00  Restarting      Task restarting in 18.16840979s
2022-03-08T09:57:49-05:00  Terminated      Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2022-03-08T09:53:31-05:00  Started         Task started by client
2022-03-08T09:53:31-05:00  Task Setup      Building Task Directory
2022-03-08T09:53:30-05:00  Received        Task received by client

Nomad will detect this condition and reschedule the allocation:

$ nomad job status example
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
2976d256  23562ac8  web         0        run      running  3s ago     2s ago
8dbf2069  003c3bb7  web         0        stop     failed   4m52s ago  3s ago

Now let's try to reproduce the client restart behavior you saw. I'll head over to the new alloc and kill the pause container there:

$ docker ps
CONTAINER ID   IMAGE                                      COMMAND                  CREATED         STATUS         PORTS     NAMES
98b53b629172   busybox:1                                  "httpd -v -f -p 8001…"   9 minutes ago   Up 9 minutes             http-2976d256-4a5d-2062-e639-27a9502c585d
ddfa6cfa5593   gcr.io/google_containers/pause-amd64:3.1   "/pause"                 9 minutes ago   Up 9 minutes             nomad_init_2976d256-4a5d-2062-e639-27a9502c585d

$ docker kill ddf
ddf

$ docker ps
CONTAINER ID   IMAGE       COMMAND                  CREATED         STATUS         PORTS     NAMES
98b53b629172   busybox:1   "httpd -v -f -p 8001…"   9 minutes ago   Up 9 minutes             http-2976d256-4a5d-2062-e639-27a9502c585d

Then I restarted the Nomad client and got the following:

$ docker ps
CONTAINER ID   IMAGE                                      COMMAND                  CREATED          STATUS          PORTS     NAMES
b705ab4232c5   gcr.io/google_containers/pause-amd64:3.1   "/pause"                 6 seconds ago    Up 5 seconds              nomad_init_2976d256-4a5d-2062-e639-27a9502c585d
98b53b629172   busybox:1                                  "httpd -v -f -p 8001…"   10 minutes ago   Up 10 minutes             http-2976d256-4a5d-2062-e639-27a9502c585d

But if I wait a moment, the pause container fails, leaving just the old container running:

$ docker ps
CONTAINER ID   IMAGE       COMMAND                  CREATED          STATUS          PORTS     NAMES
98b53b629172   busybox:1   "httpd -v -f -p 8001…"   10 minutes ago   Up 10 minutes             http-2976d256-4a5d-2062-e639-27a9502c585d

What happened? Let's ask the allocation:

$ nomad alloc status 297
...

Recent Events:
Time                       Type           Description
2022-03-08T10:09:09-05:00  Killing        Sent interrupt. Waiting 5s before force killing
2022-03-08T10:09:09-05:00  Setup Failure  failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="bridge" failed (add): failed to allocate for range 0: 172.26.64.69 has been allocated to 2976d256-4a5d-2062-e639-27a9502c585d, duplicate allocation is not allowed
2022-03-08T09:58:20-05:00  Started        Task started by client
2022-03-08T09:58:20-05:00  Task Setup     Building Task Directory
2022-03-08T09:58:19-05:00  Received       Task received by client

The full error state from the client logs is as follows:

2022-03-08T15:09:06.261Z [DEBUG] client.driver_mgr.docker: failed to create container: driver=docker container_name=nomad_init_2976d256-4a5d-2062-e639-27a9502c585d image_name=gcr.io/google_containers/pause-amd64:3.1 image_id=gcr.io/google_containers/pause-amd64:3.1 attempt=1 error="container already exists" 2022-03-08T15:09:06.286Z [INFO] client.driver_mgr.docker: purged container: driver=docker container_id=ddfa6cfa55932e6e2483883cf721917c66e74316e5f615d883b57df0dae012d7 2022-03-08T15:09:06.748Z [WARN] client.alloc_runner.runner_hook: failed to configure network: alloc_id=2976d256-4a5d-2062-e639-27a9502c585d err="plugin type="bridge" failed (add): failed to allocate for range 0: 172.26.64.69 has been allocated to 2976d256-4a5d-2062-e639-27a9502c585d, duplicate allocation is not allowed" attempt=1 2022-03-08T15:09:07.981Z [WARN] client.alloc_runner.runner_hook: failed to configure network: alloc_id=2976d256-4a5d-2062-e639-27a9502c585d err="plugin type="bridge" failed (add): container veth name provided (eth0) already exists" attempt=2 2022-03-08T15:09:09.536Z [WARN] client.alloc_runner.runner_hook: failed to configure network: alloc_id=2976d256-4a5d-2062-e639-27a9502c585d err="plugin type="bridge" failed (add): container veth name provided (eth0) already exists" attempt=3 2022-03-08T15:09:09.536Z [ERROR] client.alloc_runner: prerun failed: alloc_id=2976d256-4a5d-2062-e639-27a9502c585d error="pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="bridge" failed (add): failed to allocate for range 0: 172.26.64.69 has been allocated to 2976d256-4a5d-2062-e639-27a9502c585d, duplicate allocation is not allowed"

So there's two issues here:

Nomad isn't recreating the pause container when it should be. I'm not sure there's a good option here for us because of the way that Docker expects us to create the new container in the namespace of another container, rather than assigning the new container to a pre-existing namespace.
When Nomad restores the allocation after a restart, it doesn't correctly handle the case where the pause container is gone, and should either recreate the pause container (which is tricky because of the previous point), or it should replace the alloc entirely.

@RosieBaish there's definitely a bug here but I'm going to need to take it back to the team to figure out a path forward and to set aside some time for us to figure out a fix. I'm going to mark this issue for roadmapping. Thanks so much for the thourough bug report!

Mar 08 '22 15:03 tgross

I believe we can effectively minimize this failure scenario by configuring the pause container with an auto-restart policy. Technically there will still be a small window during the restart where if the main task fails it will not be able to join the network namespace. But if your pause container and main task fail at the same time, most likely something has gone very sideways on the node that Nomad isn't going to be able to handle anyway. The reality is docker's handling of network namespaces is very poor - if surviving this scenario is important than docker is not a suitable tool.

Jan 09 '23 19:01 shoenig

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

May 11 '23 02:05 github-actions[bot]

nomad nomad copied to clipboard

failed pause containers result in failed task restarts and orphaned tasks

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Logs

nomad
nomad copied to clipboard