nomad
nomad copied to clipboard
Docker containers managed by Nomad in bridge network mode are brought back up with broken networks.
Nomad version
Nomad v1.7.4
BuildDate 2024-02-08T14:34:12Z
Revision 29019121564e2ef7f5e2a227af6b959510bcc142
Though we are hitting it in v1.7.2 as well
Operating system and Environment details
root@client-1:~# uname -a
Linux client-1 5.15.0-67-generic #74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
root@client-1:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.2 LTS
Release: 22.04
Codename: jammy
We have hit this on multiple machines with slightly different versions, though all are Ubuntu 22.04. These are the details of a completely fresh Digital Ocean instance I used to reproduce the bug.
Issue
We have noticed that when we restart the Docker daemon on our machines every Nomad job on the client is brought back up with a busted network. To be more specific, it is brought up with no network. For example, my test container before restarting docker has the following networks:
$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: eth0@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 82:4c:5d:70:4e:fc brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.26.64.4/20 brd 172.26.79.255 scope global eth0
valid_lft forever preferred_lft foreve
and after restarting the daemon, is brought back up with just loopback:
$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft foreve
This happens with every container, including the Nomad init container. Docker restarts the containers (as expected), the veths get recreated (as expected), but the containers now lack any interfaces other than loopback (unexpected).
Things that might be notable, the nomad
network changes from <BROADCAST,MULTICAST,UP,LOWER_UP>
to <NO-CARRIER,BROADCAST,MULTICAST,UP>
and on machines with systemd-networkd
, it's logs complain about the veth's loosing carrier.
Reproduction steps
- Spin up a fresh Ubuntu 22.04 server (I used a Digital Ocean droplet for our reproduction but we've noticed this happening across our fleet so I don't think they're doing anything weird).
- Install
docker-ce
as per their docs (I used Docker's apt registry to install it). - Install Nomad as per the docs (for the reproduction I specifically used the version of Nomad from Hashicorps repos).
- Install the base CNI plugins by placing the contents of
https://github.com/containernetworking/plugins/releases/download/v1.0.0/cni-plugins-linux-amd64-v1.0.0.tgz
into/opt/cni/bin
-
systemctl start docker
-
systemctl start nomad
- Run literally any job (I've included my job file below but we've seen this happen with many jobs)
-
systemctl restart docker
Expected Result
The ip/port combo that the job binds should be curl
-able. It is before docker is restarted.
Actual Result
If you curl the ip/port combo it will complain about having no route to host:
root@client-1:~# curl -v localhost:27846
* Trying 127.0.0.1:27846...
* Trying ::1:27846...
* connect to ::1 port 27846 failed: Connection refused
* connect to 127.0.0.1 port 27846 failed: No route to host
* Failed to connect to localhost port 27846 after 3061 ms: No route to host
* Closing connection 0
curl: (7) Failed to connect to localhost port 27846 after 3061 ms: No route to host
This makes sense as executing ip addr
from within the container will now reveal the container has lost it's bridge network veth.
Job file (if appropriate)
We've noticed this happen with every job but the job file I used for the reproduction is:
job "jess-test-job" {
type = "service"
datacenters = ["*"]
group "http" {
network {
mode = "bridge"
port "http" {
to = "80"
}
}
task "whoami" {
driver = "docker"
config {
image = "strm/helloworld-http"
ports = ["http"]
}
}
}
}
The toy instance I used for reproduction has a broken journal so sadly I have no logs from that to provide. If reproduction turns out to be an issue I'd be happy to send over some logs from one of our actual failing instances but I have a hunch this won't be that hard to reproduce.
Ah, I failed to mention that the reproduction was done with the default configuration that ships with Nomad so I don't think it's something weird in there breaking things.
I have this issue, it seems to be caused by the Docker/Nomad service being offline less than the heartbeat_grace
, so Nomad doesn't consider the allocations lost and resumes them, but because Docker was offline the network namespaces are gone.
I worked around it by adding a sleep to the nomad service file which is longer than heartbeat_grace
, so allocations are always considered lost and Nomad recreates them, including the network namespaces.
The nomad cluster I use utilises fast booting lightweight VMs (less than 10s) thus nearly always hits this.
...
[Service]
EnvironmentFile=-/etc/nomad.d/nomad.env
ExecStartPre=/bin/sleep 90
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/bin/nomad agent -config /etc/nomad.d
...
Maybe https://github.com/hashicorp/nomad/pull/19886 would help when merged.
Crosslinking #15086 for visibility.
Hi @Jess3Jane and thanks for raising this issue with a great reproduction. I was able to reproduce this locally and have included details below for future readers. I'll add this to our backlog.
Host networking, Docker processes, and health check endpoint after initial start.
root@uk1-c1:/home/jrasell# ip addr show veth541d761a
17: veth541d761a@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master nomad state UP group default
link/ether ea:07:d7:03:b6:b2 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::e807:d7ff:fe03:b6b2/64 scope link
valid_lft forever preferred_lft forever
root@uk1-c1:/home/jrasell# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f8edd356ec13 redis:7 "docker-entrypoint.s…" 4 minutes ago Up 4 minutes redis-1f994fe3-06b6-dbc9-2897-72b429a61820
32d148ee127a gcr.io/google_containers/pause-arm64:3.1 "/pause" 4 minutes ago Up 4 minutes nomad_init_1f994fe3-06b6-dbc9-2897-72b429a61820
root@uk1-c1:/home/jrasell# (printf "PING\r\n";) | nc 192.168.1.121 27080
+PONG
Task events show restart of the Docker processes:
Recent Events:
Time Type Description
2024-02-20T08:36:22Z Started Task started by client
2024-02-20T08:36:04Z Restarting Task restarting in 17.156781522s
2024-02-20T08:36:04Z Terminated Exit Code: 0
2024-02-20T08:31:14Z Started Task started by client
2024-02-20T08:31:14Z Task Setup Building Task Directory
2024-02-20T08:31:14Z Received Task received by client
The health check no longer responds.
root@uk1-c1:/home/jrasell# (printf "PING\r\n";) | nc 192.168.1.121 27080
root@uk1-c1:/home/jrasell#
The Nomad client host machine (I only had this test job running on my cluster) no longer has a virtual interface configured:
ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 52:54:00:5b:f4:27 brd ff:ff:ff:ff:ff:ff
inet 192.168.121.22/24 metric 100 brd 192.168.121.255 scope global dynamic enp0s1
valid_lft 55052sec preferred_lft 55052sec
inet6 fd6b:32d9:3793:3897:5054:ff:fe5b:f427/64 scope global dynamic mngtmpaddr noprefixroute
valid_lft 2591912sec preferred_lft 604712sec
inet6 fe80::5054:ff:fe5b:f427/64 scope link
valid_lft forever preferred_lft forever
3: enp0s2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 52:54:00:1f:6b:0c brd ff:ff:ff:ff:ff:ff
inet 192.168.1.121/24 brd 192.168.1.255 scope global enp0s2
valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:fe1f:6b0c/64 scope link
valid_lft forever preferred_lft forever
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:6c:60:7c:18 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:6cff:fe60:7c18/64 scope link
valid_lft forever preferred_lft forever
11: nomad: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
link/ether e2:48:45:4d:96:6e brd ff:ff:ff:ff:ff:ff
inet 172.26.64.1/20 brd 172.26.79.255 scope global nomad
valid_lft forever preferred_lft forever
inet6 fe80::e048:45ff:fe4d:966e/64 scope link
valid_lft forever preferred_lft forever
Not sure whether this is realy related but I have similar issue together with CNI where port forwarding didn't work after all services were restarted (note: I masked the first two ip-address digits on the destination):
| plugin type="portmap" failed (add): unable to setup DNAT: running [/sbin/iptables -t nat -A CNI-DN-231ebe256ae7b6bd9006d -p tcp --dport 8084 -d 127.0.0.1 -j DNAT --to-destination x.y.70.228:8080 --wait]: exit status 4: iptables: Resource temporarily unavailable.
| pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="portmap" failed (add): unable to setup DNAT: running [/sbin/iptables -t nat -A CNI-DN-231ebe256ae7b6bd9006d -p tcp --dport 8084 -d 127.0.0.1 -j DNAT --to-destination x.y.70.228:8080 --wait]: exit status 4: iptables: Resource temporarily unavailable.
| failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="portmap" failed (add): unable to setup DNAT: running [/sbin/iptables -t nat -A CNI-DN-231ebe256ae7b6bd9006d -p tcp --dport 8084 -d 127.0.0.1 -j DNAT --to-destination x.y.70.228:8080 --wait]: exit status 4: iptables: Resource temporarily unavailable.
Seems like a race condition to me. In this case I would expect the job to fail and may be retry later.
Apologies for closing this, I think github did something silly with automation
I don't need to restart Docker for this to occur. I'm not sure WHAT is proccing the change but under bridge networking my allocations are started with just a loopback interface.