nomad
nomad copied to clipboard
iptables entries are not reconciled
When a client is stopped, the tasks on that client are left running. When the client restarts, it goes through a restore process to get handles to all its tasks again. If a task fails or is removed while the client is shutdown, the client should be able to garbage collect any of its dangling resources (like alloc dirs) and restart the task. This is not happening with iptables.
Fortunately we "tag" all the iptables rules in one of two ways:
- Placing them in a chain named
CNI-xxxx/CNI-DN-xxxx. I don't know what thatxxxxis but it's not the alloc ID, container ID, or network namespace ID. - Adding a comment in the form
/* name: "nomad" id: "<alloc ID>" *.
So it should be possible to identify "Nomad owned" rules and clean them up if they don't belong to an allocation we know about if we can figure out the naming for the CNI chains.
Nomad version
Nomad v0.10.0-dev (e2761807a346c5e3afd577b7994cfc788700bb15+CHANGES)
(But probably any recent version.)
Reproduction steps
- Run Nomad under systemd.
- Run our Consul Connect demo job:
nomad job run ./e2e/connect/input/demo.nomad - Stop the job:
nomad job stop countdash - Observe that the tasks and iptables are cleaned up properly.
docker pssudo iptables -t nat -L -v -n
- Run the job again:
nomad job run ./e2e/connect/input/demo.nomad - Stop the Nomad client with
sudo systemctl stop nomad. - Observe that the tasks and iptables are still in place.
docker pssudo iptables -t nat -L -v -n
- Remove the tasks:
docker rm -f $(docker ps -a) - Restart Nomad:
sudo systemctl start nomad - Observe that the tasks are started:
docker ps - Stop the job cleanly:
nomad job stop countdash - Observe that iptables are left behind:
sudo iptables -t nat -L -v -n
Logs
iptables after repro steps
vagrant@linux:/opt/gopath/src/github.com/hashicorp/nomad$ sudo iptables -t nat -L -v --line-numbers -n
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
num pkts bytes target prot opt in out source destination
1 29 1276 DOCKER all -- * * 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL
2 20 880 CNI-HOSTPORT-DNAT all -- * * 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
num pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 6 packets, 360 bytes)
num pkts bytes target prot opt in out source destination
1 5 300 DOCKER all -- * * 0.0.0.0/0 !127.0.0.0/8 ADDRTYPE match dst-type LOCAL
2 279 16740 CNI-HOSTPORT-DNAT all -- * * 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL
Chain POSTROUTING (policy ACCEPT 6 packets, 360 bytes)
num pkts bytes target prot opt in out source destination
1 348 20568 CNI-HOSTPORT-MASQ all -- * * 0.0.0.0/0 0.0.0.0/0 /* CNI portfwd requiring masquerade */
2 0 0 MASQUERADE all -- * !docker0 172.17.0.0/16 0.0.0.0/0
3 0 0 CNI-6fcd2f53d5f720ec4eb5f04d all -- * * 172.26.64.102 0.0.0.0/0 /* name: "nomad" id: "3e803d29-4f9d-ad8b-adb6-31456a39db69" */
4 0 0 CNI-06d73cb6cdf7130196e2018a all -- * * 172.26.64.101 0.0.0.0/0 /* name: "nomad" id: "ee25f5d7-dcb9-b336-fe3e-27e365aa5cd0" */
Chain CNI-06d73cb6cdf7130196e2018a (1 references)
num pkts bytes target prot opt in out source destination
1 0 0 ACCEPT all -- * * 0.0.0.0/0 172.26.64.0/20 /* name: "nomad" id: "ee25f5d7-dcb9-b336-fe3e-27e365aa5cd0" */
2 0 0 MASQUERADE all -- * * 0.0.0.0/0 !224.0.0.0/4 /* name: "nomad" id: "ee25f5d7-dcb9-b336-fe3e-27e365aa5cd0" */
Chain CNI-6fcd2f53d5f720ec4eb5f04d (1 references)
num pkts bytes target prot opt in out source destination
1 0 0 ACCEPT all -- * * 0.0.0.0/0 172.26.64.0/20 /* name: "nomad" id: "3e803d29-4f9d-ad8b-adb6-31456a39db69" */
2 0 0 MASQUERADE all -- * * 0.0.0.0/0 !224.0.0.0/4 /* name: "nomad" id: "3e803d29-4f9d-ad8b-adb6-31456a39db69" */
Chain CNI-HOSTPORT-DNAT (2 references)
num pkts bytes target prot opt in out source destination
Chain CNI-HOSTPORT-MASQ (1 references)
num pkts bytes target prot opt in out source destination
1 59 3540 MASQUERADE all -- * * 0.0.0.0/0 0.0.0.0/0 mark match 0x2000/0x2000
Chain CNI-HOSTPORT-SETMARK (0 references)
num pkts bytes target prot opt in out source destination
1 59 3540 MARK all -- * * 0.0.0.0/0 0.0.0.0/0 /* CNI portfwd masquerade mark */ MARK or 0x2000
Chain DOCKER (2 references)
num pkts bytes target prot opt in out source destination
1 0 0 RETURN all -- docker0 * 0.0.0.0/0 0.0.0.0/0
cc @davemay99 @angrycub as a heads up
Summary of the investigation at this point:
- When the client restarts, the network hook's
Prerunfires and tries to recreate the network and setup the iptables via CNI. - This fails because the netns already exists, as expected. So we tear down the task and start over.
- But in the next pass when we setup the iptables via CNI, we collide with the iptables left behind.
- To fix this we need the network namespace path (which is used by CNI as part of the handle for
go-cni#Network.Remove). - In the non-Docker case, Nomad controls the path to the netns and it's derived from the alloc ID, but in the Docker case (which includes all cases with Connect integration b/c of the Envoy container), Docker owns that path and derives it from the pause container name. So we can't use a deterministic name as a handle to clean up. But we can't get it from Docker either because at the point we need it that container has already been removed.
I've verified the following more common failure modes are handled correctly:
- Tasks recover fully when the client restarts (after a few PRs we landed in the current 0.10.0 release branch)
- There's no resource leak when the client restarts if the containers aren't removed.
- There's no resource leak when the client restarts as part of a node (machine) reboot.
Status:
- We could try to fix this by threading state about the network namespace from the allocation runner back into the state store, similar to how we deal with deployment health state. But this will always be subject to races between client failures and state syncs.
- We already have a PR open for 0.10.x to reconcile and GC Docker containers. Because all the rules we're creating are tagged with the string "nomad" and Nomad's alloc IDs, we can make a similar loop for iptable GC.
- Because I've verified that this leak doesn't happen in the common failure modes of a client or node reboot, we're not going to block the 0.10.0 release on this. We'll work up a PR for an out-of-band reconcile loop for 0.10.x.
Moving this issue out of the 0.10.0 milestone.
Noting this isn't the same as #7537 - the repro steps here still leak rules, e.g.
after.txt
Chain PREROUTING (policy ACCEPT 12 packets, 5781 bytes)
pkts bytes target prot opt in out source destination
154 74057 DOCKER all -- * * 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL
118 56714 CNI-HOSTPORT-DNAT all -- * * 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL
118 56714 CNI-HOSTPORT-DNAT all -- * * 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL
Chain INPUT (policy ACCEPT 12 packets, 5781 bytes)
pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 32 packets, 3515 bytes)
pkts bytes target prot opt in out source destination
463 31262 DOCKER all -- * * 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL
264 17528 CNI-HOSTPORT-DNAT all -- * * 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL
245 16388 CNI-HOSTPORT-DNAT all -- * * 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL
Chain POSTROUTING (policy ACCEPT 32 packets, 3515 bytes)
pkts bytes target prot opt in out source destination
448 40004 CNI-HOSTPORT-MASQ all -- * * 0.0.0.0/0 0.0.0.0/0 /* CNI portfwd requiring masquerade */
429 38864 CNI-HOSTPORT-MASQ all -- * * 0.0.0.0/0 0.0.0.0/0 /* CNI portfwd requiring masquerade */
10 720 MASQUERADE all -- * docker0 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match src-type LOCAL
0 0 MASQUERADE all -- * !docker0 172.30.254.0/24 0.0.0.0/0
0 0 CNI-11bedb6da1593ed3af43ef13 all -- * * 172.26.65.117 0.0.0.0/0 /* name: "nomad" id: "2c0c32b5-2ea3-471f-378f-a740391bea60" */
0 0 CNI-b5b13e7bdc4638b22a8a6e73 all -- * * 172.26.65.118 0.0.0.0/0 /* name: "nomad" id: "6eb897ef-4de8-9a2c-22cc-5965fe282b19" */
Chain CNI-11bedb6da1593ed3af43ef13 (1 references)
pkts bytes target prot opt in out source destination
0 0 ACCEPT all -- * * 0.0.0.0/0 172.26.64.0/20 /* name: "nomad" id: "2c0c32b5-2ea3-471f-378f-a740391bea60" */
0 0 MASQUERADE all -- * * 0.0.0.0/0 !224.0.0.0/4 /* name: "nomad" id: "2c0c32b5-2ea3-471f-378f-a740391bea60" */
Chain CNI-HOSTPORT-DNAT (4 references)
pkts bytes target prot opt in out source destination
Chain CNI-HOSTPORT-MASQ (2 references)
pkts bytes target prot opt in out source destination
19 1140 MASQUERADE all -- * * 0.0.0.0/0 0.0.0.0/0 mark match 0x2000/0x2000
0 0 MASQUERADE all -- * * 0.0.0.0/0 0.0.0.0/0 mark match 0x2000/0x2000
Chain CNI-HOSTPORT-SETMARK (0 references)
pkts bytes target prot opt in out source destination
19 1140 MARK all -- * * 0.0.0.0/0 0.0.0.0/0 /* CNI portfwd masquerade mark */ MARK or 0x2000
19 1140 MARK all -- * * 0.0.0.0/0 0.0.0.0/0 /* CNI portfwd masquerade mark */ MARK or 0x2000
Chain CNI-b5b13e7bdc4638b22a8a6e73 (1 references)
pkts bytes target prot opt in out source destination
0 0 ACCEPT all -- * * 0.0.0.0/0 172.26.64.0/20 /* name: "nomad" id: "6eb897ef-4de8-9a2c-22cc-5965fe282b19" */
0 0 MASQUERADE all -- * * 0.0.0.0/0 !224.0.0.0/4 /* name: "nomad" id: "6eb897ef-4de8-9a2c-22cc-5965fe282b19" */
Chain DOCKER (2 references)
pkts bytes target prot opt in out source destination
We have a similar issue - iptables is a mess after some Nomad restarts.
We're on Nomad 1.0.4
We have hit this too. Having a few stale iptables rules is fine until new allocations are assigned with ports used by the stale rules.
CNI-DN-b545c28573a241d01dadd tcp -- 0.0.0.0/0 0.0.0.0/0 /* dnat name: "nomad" id: "416a88b0-76fa-e270-6c80-7f102216ca13" */ multiport dports 23674,23642
CNI-DN-b545c28573a241d01dadd udp -- 0.0.0.0/0 0.0.0.0/0 /* dnat name: "nomad" id: "416a88b0-76fa-e270-6c80-7f102216ca13" */ multiport dports 23674,23642
CNI-DN-7833d1be7094c0db0c99d tcp -- 0.0.0.0/0 0.0.0.0/0 /* dnat name: "nomad" id: "aace5b7e-8e97-43d4-5705-06a94f326aee" */ multiport dports 23674,22463,23641,21075
CNI-DN-7833d1be7094c0db0c99d udp -- 0.0.0.0/0 0.0.0.0/0 /* dnat name: "nomad" id: "aace5b7e-8e97-43d4-5705-06a94f326aee" */ multiport dports 23674,22463,23641,21075
In the above example, 416a88b0-76fa-e270-6c80-7f102216ca13 was removed while aace5b7e-8e97-43d4-5705-06a94f326aee is a new allocation. Clients of the service will get an error Unable to establish connection to 10.133.67.138:23674 while trying to send requests to the running allocation.
Even though we almost always drain nodes before manual nomad restart, sometimes nomad gets restarted automatically by systemd due to consul failure or other reasons, in which case we'd end up having stale iptables rules. So it'd be great if the iptables rule reconciliation can be implemented. 🙏 🤞
I'm seeing a slightly different trigger of this issue, but with the same root cause and end results. We use puppet to manage some iptables rules on the host[0]. When puppet makes a ruleset change, it triggers the new ruleset to be persisted (on EL, that's via iptables-save to /etc/sysconfig/iptables). This saved ruleset includes all the permanent rules we're managing via puppet, but also all the "transient" rules installed by nomad/cni plugins. Therefore the next time the ruleset is loaded (e.g. after host reboot), the iptables chains are pre-filled with stale rules from historic tasks. Currently nothing is cleaning those up, and since nomad is appending new rules, the saved rules are higher up in the NAT chains. This is causing particular pain where our ingress containers which listen on static ports 80, 443 get caught in the cached ruleset, because then after a reboot, the NAT rules redirect the traffic to a blackhole instead of the running ingress container. It also affects tasks that don't use static port numbers, but then the pain is deferred to when the port eventually gets reused and is harder to track down.
This is on nomad 1.1.5.
[0] One of the rules we're managing is -A FORWARD -m conntrack --ctstate INVALID -j DROP to drop packets which the conntrack module thinks are not valid, will not apply NAT. The kernel treats these as unsolicited packets, returning a TCP-RST. This tears down the connection between the container and the external service, causing disruption. There seem to be quite a few bug reports about this relating to docker, kubernetes etc, and this rule is the widely accepted workaround.
I've just re-read the upgrade guide for (in preparation for 1.2.0), and I think the changes in that 1.1.0 to append the CNI rules rather than insert at the top of the chain is what made this issue more noticeable (https://github.com/hashicorp/nomad/pull/10181). Previously, had transient rules been persisted, the next time an alloc was started, the new iptables rules would be inserted above the stale ones and thus take precedence. Now they are added below the stale rules, so traffic is matched and blackholed by the stale rules.
As noted in #11901, this affects us quite badly right now (though we're using nftables as opposed to iptables, the issue and result is the same). Whenever an unclean stop or agent restart has occurred for a job with static ports, those ports will (silently, no errors, local checks seem to succeed) fail to bind again until reboot or stale rules are manually removed.
While at first glance it looked like this was a regression caused by the priority-inversion in #10181 (as noted by @optiz0r ), that PR looks concerned only with the with NOMAD-ADMIN chain while in our case the issue is with stale rules blackholing dports under the CNI-HOSTPORT-DNAT chain (or maybe they're indeed the same after CNI does its magic?).
This is becoming a major security and stability issue as we are seeing allocations try to forward from ports that already have rules in iptables, and requests bound for them are getting forwarded based on the stale iptables rule. Is there anything we can do to ensure this gets prioritized? Or can someone share a cleanup script they have been using?
Here is the error log around the time it fails to cleanup the stale allocation:
Logs
containerd[3148]: time="2022-05-24T07:05:43.326040572Z" level=info msg="shim disconnected" id=d13e150a783b3a72482e859901590f7d002b71c43c36ffe2f0d46aecca64e794
containerd[3148]: time="2022-05-24T07:05:43.326107477Z" level=error msg="copy shim log" error="read /proc/self/fd/17: file already closed"
dockerd[3364]: time="2022-05-24T07:05:43.326090953Z" level=info msg="ignoring event" container=d13e150a783b3a72482e859901590f7d002b71c43c36ffe2f0d46aecca64e794 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
consul[3318]: 2022-05-24T07:05:43.342Z [WARN] agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '_nomad-task-4984d1ec-c7ab-09ad-883f-3c66cc9d1cb7-group-<redacted>-8083' not in agent state" index
consul[3318]: 2022-05-24T07:05:43.342Z [WARN] agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '_nomad-task-4984d1ec-c7ab-09ad-883f-3c66cc9d1cb7-group-<redacted>-8083' not in agent state" index
consul[3318]: 2022-05-24T07:05:43.349Z [WARN] agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '_nomad-task-4984d1ec-c7ab-09ad-883f-3c66cc9d1cb7-group-<redacted>-8083' not in agent state" index
consul[3318]: 2022-05-24T07:05:43.349Z [WARN] agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '_nomad-task-4984d1ec-c7ab-09ad-883f-3c66cc9d1cb7-group-<redacted>-8083' not in agent state" index
consul[3318]: 2022-05-24T07:05:43.380Z [WARN] agent: Failed to deregister service: service=_nomad-task-4984d1ec-c7ab-09ad-883f-3c66cc9d1cb7-group-<redacted>-8083-sidecar-proxy error="Service "_nomad-task-4984d1ec-c7ab-09ad-883f-3c66cc9d1cb7-group-<redacted>
consul[3318]: 2022-05-24T07:05:43.380Z [WARN] agent: Failed to deregister service: service=_nomad-task-4984d1ec-c7ab-09ad-883f-3c66cc9d1cb7-group-<redacted>-8083-sidecar-proxy error="Service "_nomad-task-4984d1ec-c7ab-09ad-883f-3c66cc9d1cb7-group-<redacted>
kernel: docker0: port 1(vethc6029f5) entered disabled state
kernel: vethb5b69c3: renamed from eth0
kernel: docker0: port 1(vethc6029f5) entered disabled state
kernel: device vethc6029f5 left promiscuous mode
kernel: docker0: port 1(vethc6029f5) entered disabled state
systemd[1]: Stopping Nomad...
Hit this same issue today. Performed some manual iptables clean up on a problem client. Here are my notes in case this helps anyone else:
- Seems likely it was caused by some combination of quick nomad job allocation stop and re-deployment and/or nomad systemd service restarts, possibly before cleanup of a stopped allocation could be cleaned up.
- Symptom first noticed indicitive of a problem was of failed consul healthcheck, where the port that should be the listener on the host level and is properly bridged into the container as seen from the Nomad UI and nomad job config just isn't working.
- Tcpdump on the client machine to the port listener which doesn't appear to work shows SYN packet sent to a nomad bridge IP (
[S]) and a RESET packet immediately returned ([R.]) - Looking at the tcpdump output shows that the packet is actually being sent to the wrong nomad bridge IP, and further looking at the IP tables, shows that there are duplicate (or more) rules set up for port forwarding the listener into the Nomad bridge due to unclean allocation handling of an old allocation.
- On our OS, cleaning up the iptables can be done with a client reboot, but cleaning up by hand in these occurrences can be also be done.
- General procedure -- check iptables (command refs below). There will be iptables entries which show comments of allocation ID association. Check for non-existent allocation IDs being present and/or conflicting with existing allocation iptables rules. If such allocation IDs are seen, they will also be seen to be associated with user defined iptables chains starting with CNI-[a-f0-9] and CNI-DN-[a-f0-9]. These can all be purged with the example cmds below:
# In this case, old rules from a nomad bridge IP with no active allocation superseded the correct rules to a nomad bridge IP with an active allocation listener.
# Find references to obsolete iptables rules with missing allocation IDs (first cmds contains all info, addnl cmds are a bit more verbose)
iptables-save
iptables -t filter -L -v -n
iptables -t nat -L -v -n
# Delete rules specific to the bad allocation IP from the filter and NAT tables
iptables -t filter -D CNI-FORWARD -d 172.26.66.2/32 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
iptables -t filter -D CNI-FORWARD -s 172.26.66.2/32 -j ACCEPT
# Delete references to the user defined chains of the non-existent allocation from the filter and NAT tables
iptables -t nat -D POSTROUTING -s 172.26.66.2/32 -m comment --comment "name: \"nomad\" id: \"8145e50e-b164-693e-2136-8055fde5ad10\"" -j CNI-fe647aec064bf60036a312df
iptables -t nat -D CNI-HOSTPORT-DNAT -p tcp -m comment --comment "dnat name: \"nomad\" id: \"8145e50e-b164-693e-2136-8055fde5ad10\"" -m multiport --dports 8008,5432 -j CNI-DN-fe647aec064bf60036a31
iptables -t nat -D CNI-HOSTPORT-DNAT -p udp -m comment --comment "dnat name: \"nomad\" id: \"8145e50e-b164-693e-2136-8055fde5ad10\"" -m multiport --dports 8008,5432 -j CNI-DN-fe647aec064bf60036a31
# Delete rules from user defined chains of the non-existent allocation
iptables -t nat -F CNI-DN-fe647aec064bf60036a31
iptables -t nat -F CNI-fe647aec064bf60036a312df
# Delete the user defined chains from the non-existent allocation
iptables -t nat -X CNI-DN-fe647aec064bf60036a31
iptables -t nat -X CNI-fe647aec064bf60036a312df
Any update on this issue? Facing it and it's causing very annoying stability issues on a select few hosts.
Please fix this or offer a proper solution, I don't care if we have to run a script to do it, but something that can be automated would be nice. We've positioned our whole infrastructure on Nomad, and this is killing us. We would prefer not to jump ship, but I'm still concerned how this isn't affecting other users?
Affects us as well
Hey folks, we update issues when we're working on them. I can say this is on our roadmap but I can't really give a timeline.
What are we supposed to do in the meantime?
This issue is pretty devastating for the use case at my PoB. Is there a workaround that can be implemented until an official fix comes out? At the moment we have to do full reboots of Nomad and take our whole network offline when we run into it.
@johnalotoski has posted a process above; if you were to run that as a periodic task (or just a cron job) that'd clean up the iptables.
Little bit difficult to script around that honestly, since you're also having to compare to what allocations exist, and what host they are on
Okay, we're moving on from this, we can't support our org with this