nomad
nomad copied to clipboard
failed pause containers result in failed task restarts and orphaned tasks
Nomad version
$ nomad --version Nomad v1.2.3 (a79efc8422082c4790046c3f5ad92c542592a54f)
Operating system and Environment details
$ uname -a
Linux
Issue
When using the docker driver, if one of the nomad_init_<UUID> containers dies, the other containers in the task group do not get cleaned up.
We discovered this while doing some general testing around "What happens if you kill random containers in random combinations"
I'm aware that a nomad_init_ container failing in production is highly unlikely, but I thought it was worth the report anyway!
Reproduction steps
Create and run a task group as below
$ docker container ls
3 containers, name-<UUID>, connect-proxy-name-<UUID> and nomad_init_<UUID>
$ docker stop $(id of nomad_init_<UUID>)
$ docker container ls
2 containers, name-<UUID>, connect-proxy-name-<UUID>
$ systemctl restart nomad
$ docker container ls
Expected Result
3 containers, name-<UUID>, connect-proxy-name-<UUID> and nomad_init_<UUID>
Where UUID is either the same as before, or a new one
Actual Result
5 containers,
name-<new_UUID>connect-proxy-name-<new_UUID>nomad_init_<new_UUID>name-<old_UUID>connect-proxy-name-<old_UUID>Nomad has created a new set of containers to replace the task group, but has not cleaned up the existing 2. I would expect it either to re-use them and thus re-create the nomad_init_ container, or delete them and replace them.
Job file (if appropriate)
group "<redacted>" {
task "<redacted>" {
driver = "docker"
kill_timeout = "20s"
config {
image = "<redacted>"
auth {
username = "<redacted>"
password = "<redacted>"
}
}
env {
<redacted>
}
resources {
cpu = 200
memory = 300
}
}
network {
mode = "bridge"
}
service {
name = "<redacted>"
}
}
The docker image is a local one pulled from a private repo but I suspect the exact image is irrelevant.
data_dir = "<redacted>"
datacenter = "<redacted>"
bind_addr = "0.0.0.0"
log_level = "DEBUG"
server {
enabled = true
bootstrap_expect = 1
raft_protocol = 3
encrypt = "<redacted>"
}
consul {
address = "127.0.0.1:8500"
server_service_name = "nomad"
client_service_name = "nomad-client"
auto_advertise = true
server_auto_join = true
client_auto_join = true
token = "<redacted>"
}
client {
enabled = true
servers = ["127.0.0.1"]
cni_path = "/opt/cni/bin"
template {
disable_file_sandbox = true
}
host_volume "<redacted>" {
path = "<redacted>"
read_only = false
}
host_volume "<redacted>" {
path = "<redacted>"
read_only = false
}
}
plugin "docker" {
config {
allow_caps = ["net_admin", "net_raw", "audit_write", "chown", "dac_override", "fowner", "fsetid", "kill", "mknod", "net_bind_service", "setfcap", "setgid", "setpcap", "setuid", "sys_chroot"]
}
}
Logs
I can provide logs if you're unable to reproduce, but it'll take a while to redact everything so if you can reproduce it I won't bother
Hi @RosieBaish! I was able to more-or-less confirm the behavior you're seeing, and as it turns out the behavior on client restart is only a symptom of a different problem.
Let's use the following jobspec, so that we have a fairly simple application without a bridge network that requires the pause container (nomad_init_$uuid), but cuts out the complexity of Connect:
jobspec
job "example" {
datacenters = ["dc1"]
group "web" {
network {
mode = "bridge"
port "www" {
to = 8001
}
}
task "http" {
driver = "docker"
config {
image = "busybox:1"
command = "httpd"
args = ["-v", "-f", "-p", "8001", "-h", "/local"]
ports = ["www"]
}
template {
data = "<html>hello, world</html>"
destination = "local/index.html"
}
resources {
cpu = 128
memory = 128
}
}
}
}
If we run this job and look at the resulting containers we see the main container and the pause container:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
60da9b25e9b5 busybox:1 "httpd -v -f -p 8001…" About a minute ago Up About a minute http-8dbf2069-8052-340c-377b-e62d83817dca
88911595dd35 gcr.io/google_containers/pause-amd64:3.1 "/pause" About a minute ago Up About a minute nomad_init_8dbf2069-8052-340c-377b-e62d83817dca
Now let's kill the pause container:
$ docker kill 889
889
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
60da9b25e9b5 busybox:1 "httpd -v -f -p 8001…" About a minute ago Up About a minute http-8dbf2069-8052-340c-377b-e62d83817dca
Note that if we wait a few moments, we can see that Nomad isn't recreating the pause container, which at first glance isn't a problem:
$ nomad alloc status 8db
...
Allocation Addresses (mode = "bridge")
Label Dynamic Address
*www yes 10.0.2.15:27877 -> 8001
...
$ curl 10.0.2.15:27877
<html>hello, world</html>
But it is a problem because if the main container fails, there's no pause container to owns the network namespace anymore (this is a quirk of Docker that a particular container needs to "own" the namespace that we join other containers to, and Nomad inherits that quirk):
$ docker kill 60d
60d
$ nomad alloc status 8db
...
Recent Events:
Time Type Description
2022-03-08T09:57:49-05:00 Killing Sent interrupt. Waiting 5s before force killing
2022-03-08T09:57:49-05:00 Not Restarting Error was unrecoverable
2022-03-08T09:57:49-05:00 Driver Failure Failed to start container 38e4296e7d221f7d4f1dae3ed888e87f967e84ee865efb9954cac036320f411b: API error (409): cannot join network of a non running container: 88911595dd35c2bc2e80f70325ea6485795339c56c70a59f585626016885b85f
2022-03-08T09:57:49-05:00 Restarting Task restarting in 18.16840979s
2022-03-08T09:57:49-05:00 Terminated Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2022-03-08T09:53:31-05:00 Started Task started by client
2022-03-08T09:53:31-05:00 Task Setup Building Task Directory
2022-03-08T09:53:30-05:00 Received Task received by client
Nomad will detect this condition and reschedule the allocation:
$ nomad job status example
...
Allocations
ID Node ID Task Group Version Desired Status Created Modified
2976d256 23562ac8 web 0 run running 3s ago 2s ago
8dbf2069 003c3bb7 web 0 stop failed 4m52s ago 3s ago
Now let's try to reproduce the client restart behavior you saw. I'll head over to the new alloc and kill the pause container there:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
98b53b629172 busybox:1 "httpd -v -f -p 8001…" 9 minutes ago Up 9 minutes http-2976d256-4a5d-2062-e639-27a9502c585d
ddfa6cfa5593 gcr.io/google_containers/pause-amd64:3.1 "/pause" 9 minutes ago Up 9 minutes nomad_init_2976d256-4a5d-2062-e639-27a9502c585d
$ docker kill ddf
ddf
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
98b53b629172 busybox:1 "httpd -v -f -p 8001…" 9 minutes ago Up 9 minutes http-2976d256-4a5d-2062-e639-27a9502c585d
Then I restarted the Nomad client and got the following:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b705ab4232c5 gcr.io/google_containers/pause-amd64:3.1 "/pause" 6 seconds ago Up 5 seconds nomad_init_2976d256-4a5d-2062-e639-27a9502c585d
98b53b629172 busybox:1 "httpd -v -f -p 8001…" 10 minutes ago Up 10 minutes http-2976d256-4a5d-2062-e639-27a9502c585d
But if I wait a moment, the pause container fails, leaving just the old container running:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
98b53b629172 busybox:1 "httpd -v -f -p 8001…" 10 minutes ago Up 10 minutes http-2976d256-4a5d-2062-e639-27a9502c585d
What happened? Let's ask the allocation:
$ nomad alloc status 297
...
Recent Events:
Time Type Description
2022-03-08T10:09:09-05:00 Killing Sent interrupt. Waiting 5s before force killing
2022-03-08T10:09:09-05:00 Setup Failure failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="bridge" failed (add): failed to allocate for range 0: 172.26.64.69 has been allocated to 2976d256-4a5d-2062-e639-27a9502c585d, duplicate allocation is not allowed
2022-03-08T09:58:20-05:00 Started Task started by client
2022-03-08T09:58:20-05:00 Task Setup Building Task Directory
2022-03-08T09:58:19-05:00 Received Task received by client
The full error state from the client logs is as follows:
2022-03-08T15:09:06.261Z [DEBUG] client.driver_mgr.docker: failed to create container: driver=docker container_name=nomad_init_2976d256-4a5d-2062-e639-27a9502c585d image_name=gcr.io/google_containers/pause-amd64:3.1 image_id=gcr.io/google_containers/pause-amd64:3.1 attempt=1 error="container already exists" 2022-03-08T15:09:06.286Z [INFO] client.driver_mgr.docker: purged container: driver=docker container_id=ddfa6cfa55932e6e2483883cf721917c66e74316e5f615d883b57df0dae012d7 2022-03-08T15:09:06.748Z [WARN] client.alloc_runner.runner_hook: failed to configure network: alloc_id=2976d256-4a5d-2062-e639-27a9502c585d err="plugin type="bridge" failed (add): failed to allocate for range 0: 172.26.64.69 has been allocated to 2976d256-4a5d-2062-e639-27a9502c585d, duplicate allocation is not allowed" attempt=1 2022-03-08T15:09:07.981Z [WARN] client.alloc_runner.runner_hook: failed to configure network: alloc_id=2976d256-4a5d-2062-e639-27a9502c585d err="plugin type="bridge" failed (add): container veth name provided (eth0) already exists" attempt=2 2022-03-08T15:09:09.536Z [WARN] client.alloc_runner.runner_hook: failed to configure network: alloc_id=2976d256-4a5d-2062-e639-27a9502c585d err="plugin type="bridge" failed (add): container veth name provided (eth0) already exists" attempt=3 2022-03-08T15:09:09.536Z [ERROR] client.alloc_runner: prerun failed: alloc_id=2976d256-4a5d-2062-e639-27a9502c585d error="pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="bridge" failed (add): failed to allocate for range 0: 172.26.64.69 has been allocated to 2976d256-4a5d-2062-e639-27a9502c585d, duplicate allocation is not allowed"
So there's two issues here:
- Nomad isn't recreating the pause container when it should be. I'm not sure there's a good option here for us because of the way that Docker expects us to create the new container in the namespace of another container, rather than assigning the new container to a pre-existing namespace.
- When Nomad restores the allocation after a restart, it doesn't correctly handle the case where the pause container is gone, and should either recreate the pause container (which is tricky because of the previous point), or it should replace the alloc entirely.
@RosieBaish there's definitely a bug here but I'm going to need to take it back to the team to figure out a path forward and to set aside some time for us to figure out a fix. I'm going to mark this issue for roadmapping. Thanks so much for the thourough bug report!
I believe we can effectively minimize this failure scenario by configuring the pause container with an auto-restart policy. Technically there will still be a small window during the restart where if the main task fails it will not be able to join the network namespace. But if your pause container and main task fail at the same time, most likely something has gone very sideways on the node that Nomad isn't going to be able to handle anyway. The reality is docker's handling of network namespaces is very poor - if surviving this scenario is important than docker is not a suitable tool.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.