nomad iptables entries are not reconciled

When a client is stopped, the tasks on that client are left running. When the client restarts, it goes through a restore process to get handles to all its tasks again. If a task fails or is removed while the client is shutdown, the client should be able to garbage collect any of its dangling resources (like alloc dirs) and restart the task. This is not happening with iptables.

Fortunately we "tag" all the iptables rules in one of two ways:

Placing them in a chain named CNI-xxxx/CNI-DN-xxxx. I don't know what that xxxx is but it's not the alloc ID, container ID, or network namespace ID.
Adding a comment in the form /* name: "nomad" id: "<alloc ID>" *.

So it should be possible to identify "Nomad owned" rules and clean them up if they don't belong to an allocation we know about if we can figure out the naming for the CNI chains.

Nomad version

Nomad v0.10.0-dev (e2761807a346c5e3afd577b7994cfc788700bb15+CHANGES)

(But probably any recent version.)

Reproduction steps

Run Nomad under systemd.
Run our Consul Connect demo job: nomad job run ./e2e/connect/input/demo.nomad
Stop the job: nomad job stop countdash
Observe that the tasks and iptables are cleaned up properly.
- docker ps
- sudo iptables -t nat -L -v -n
Run the job again: nomad job run ./e2e/connect/input/demo.nomad
Stop the Nomad client with sudo systemctl stop nomad.
Observe that the tasks and iptables are still in place.
- docker ps
- sudo iptables -t nat -L -v -n
Remove the tasks: docker rm -f $(docker ps -a)
Restart Nomad: sudo systemctl start nomad
Observe that the tasks are started: docker ps
Stop the job cleanly: nomad job stop countdash
Observe that iptables are left behind: sudo iptables -t nat -L -v -n

Logs

iptables after repro steps

vagrant@linux:/opt/gopath/src/github.com/hashicorp/nomad$ sudo iptables -t nat -L -v --line-numbers -n
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
num   pkts bytes target     prot opt in     out     source               destination
1       29  1276 DOCKER     all  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL
2       20   880 CNI-HOSTPORT-DNAT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL

Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
num   pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 6 packets, 360 bytes)
num   pkts bytes target     prot opt in     out     source               destination
1        5   300 DOCKER     all  --  *      *       0.0.0.0/0           !127.0.0.0/8          ADDRTYPE match dst-type LOCAL
2      279 16740 CNI-HOSTPORT-DNAT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL

Chain POSTROUTING (policy ACCEPT 6 packets, 360 bytes)
num   pkts bytes target     prot opt in     out     source               destination
1      348 20568 CNI-HOSTPORT-MASQ  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* CNI portfwd requiring masquerade */
2        0     0 MASQUERADE  all  --  *      !docker0  172.17.0.0/16        0.0.0.0/0
3        0     0 CNI-6fcd2f53d5f720ec4eb5f04d  all  --  *      *       172.26.64.102        0.0.0.0/0            /* name: "nomad" id: "3e803d29-4f9d-ad8b-adb6-31456a39db69" */
4        0     0 CNI-06d73cb6cdf7130196e2018a  all  --  *      *       172.26.64.101        0.0.0.0/0            /* name: "nomad" id: "ee25f5d7-dcb9-b336-fe3e-27e365aa5cd0" */

Chain CNI-06d73cb6cdf7130196e2018a (1 references)
num   pkts bytes target     prot opt in     out     source               destination
1        0     0 ACCEPT     all  --  *      *       0.0.0.0/0            172.26.64.0/20       /* name: "nomad" id: "ee25f5d7-dcb9-b336-fe3e-27e365aa5cd0" */
2        0     0 MASQUERADE  all  --  *      *       0.0.0.0/0           !224.0.0.0/4          /* name: "nomad" id: "ee25f5d7-dcb9-b336-fe3e-27e365aa5cd0" */

Chain CNI-6fcd2f53d5f720ec4eb5f04d (1 references)
num   pkts bytes target     prot opt in     out     source               destination
1        0     0 ACCEPT     all  --  *      *       0.0.0.0/0            172.26.64.0/20       /* name: "nomad" id: "3e803d29-4f9d-ad8b-adb6-31456a39db69" */
2        0     0 MASQUERADE  all  --  *      *       0.0.0.0/0           !224.0.0.0/4          /* name: "nomad" id: "3e803d29-4f9d-ad8b-adb6-31456a39db69" */

Chain CNI-HOSTPORT-DNAT (2 references)
num   pkts bytes target     prot opt in     out     source               destination

Chain CNI-HOSTPORT-MASQ (1 references)
num   pkts bytes target     prot opt in     out     source               destination
1       59  3540 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x2000/0x2000

Chain CNI-HOSTPORT-SETMARK (0 references)
num   pkts bytes target     prot opt in     out     source               destination
1       59  3540 MARK       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* CNI portfwd masquerade mark */ MARK or 0x2000

Chain DOCKER (2 references)
num   pkts bytes target     prot opt in     out     source               destination
1        0     0 RETURN     all  --  docker0 *       0.0.0.0/0            0.0.0.0/0

cc @davemay99 @angrycub as a heads up

Sep 26 '19 12:09 tgross

Summary of the investigation at this point:

When the client restarts, the network hook's Prerun fires and tries to recreate the network and setup the iptables via CNI.
This fails because the netns already exists, as expected. So we tear down the task and start over.
But in the next pass when we setup the iptables via CNI, we collide with the iptables left behind.
To fix this we need the network namespace path (which is used by CNI as part of the handle for go-cni#Network.Remove).
In the non-Docker case, Nomad controls the path to the netns and it's derived from the alloc ID, but in the Docker case (which includes all cases with Connect integration b/c of the Envoy container), Docker owns that path and derives it from the pause container name. So we can't use a deterministic name as a handle to clean up. But we can't get it from Docker either because at the point we need it that container has already been removed.

I've verified the following more common failure modes are handled correctly:

Tasks recover fully when the client restarts (after a few PRs we landed in the current 0.10.0 release branch)
There's no resource leak when the client restarts if the containers aren't removed.
There's no resource leak when the client restarts as part of a node (machine) reboot.

Status:

We could try to fix this by threading state about the network namespace from the allocation runner back into the state store, similar to how we deal with deployment health state. But this will always be subject to races between client failures and state syncs.
We already have a PR open for 0.10.x to reconcile and GC Docker containers. Because all the rules we're creating are tagged with the string "nomad" and Nomad's alloc IDs, we can make a similar loop for iptable GC.
Because I've verified that this leak doesn't happen in the common failure modes of a client or node reboot, we're not going to block the 0.10.0 release on this. We'll work up a PR for an out-of-band reconcile loop for 0.10.x.

Moving this issue out of the 0.10.0 milestone.

Oct 09 '19 18:10 tgross

Noting this isn't the same as #7537 - the repro steps here still leak rules, e.g.

after.txt

Chain PREROUTING (policy ACCEPT 12 packets, 5781 bytes)
 pkts bytes target     prot opt in     out     source               destination         
  154 74057 DOCKER     all  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL
  118 56714 CNI-HOSTPORT-DNAT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL
  118 56714 CNI-HOSTPORT-DNAT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL

Chain INPUT (policy ACCEPT 12 packets, 5781 bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain OUTPUT (policy ACCEPT 32 packets, 3515 bytes)
 pkts bytes target     prot opt in     out     source               destination         
  463 31262 DOCKER     all  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL
  264 17528 CNI-HOSTPORT-DNAT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL
  245 16388 CNI-HOSTPORT-DNAT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL

Chain POSTROUTING (policy ACCEPT 32 packets, 3515 bytes)
 pkts bytes target     prot opt in     out     source               destination         
  448 40004 CNI-HOSTPORT-MASQ  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* CNI portfwd requiring masquerade */
  429 38864 CNI-HOSTPORT-MASQ  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* CNI portfwd requiring masquerade */
   10   720 MASQUERADE  all  --  *      docker0  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match src-type LOCAL
    0     0 MASQUERADE  all  --  *      !docker0  172.30.254.0/24      0.0.0.0/0           
    0     0 CNI-11bedb6da1593ed3af43ef13  all  --  *      *       172.26.65.117        0.0.0.0/0            /* name: "nomad" id: "2c0c32b5-2ea3-471f-378f-a740391bea60" */
    0     0 CNI-b5b13e7bdc4638b22a8a6e73  all  --  *      *       172.26.65.118        0.0.0.0/0            /* name: "nomad" id: "6eb897ef-4de8-9a2c-22cc-5965fe282b19" */

Chain CNI-11bedb6da1593ed3af43ef13 (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 ACCEPT     all  --  *      *       0.0.0.0/0            172.26.64.0/20       /* name: "nomad" id: "2c0c32b5-2ea3-471f-378f-a740391bea60" */
    0     0 MASQUERADE  all  --  *      *       0.0.0.0/0           !224.0.0.0/4          /* name: "nomad" id: "2c0c32b5-2ea3-471f-378f-a740391bea60" */

Chain CNI-HOSTPORT-DNAT (4 references)
 pkts bytes target     prot opt in     out     source               destination         

Chain CNI-HOSTPORT-MASQ (2 references)
 pkts bytes target     prot opt in     out     source               destination         
   19  1140 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x2000/0x2000
    0     0 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x2000/0x2000

Chain CNI-HOSTPORT-SETMARK (0 references)
 pkts bytes target     prot opt in     out     source               destination         
   19  1140 MARK       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* CNI portfwd masquerade mark */ MARK or 0x2000
   19  1140 MARK       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* CNI portfwd masquerade mark */ MARK or 0x2000

Chain CNI-b5b13e7bdc4638b22a8a6e73 (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 ACCEPT     all  --  *      *       0.0.0.0/0            172.26.64.0/20       /* name: "nomad" id: "6eb897ef-4de8-9a2c-22cc-5965fe282b19" */
    0     0 MASQUERADE  all  --  *      *       0.0.0.0/0           !224.0.0.0/4          /* name: "nomad" id: "6eb897ef-4de8-9a2c-22cc-5965fe282b19" */

Chain DOCKER (2 references)
 pkts bytes target     prot opt in     out     source               destination

Apr 07 '20 18:04 shoenig

We have a similar issue - iptables is a mess after some Nomad restarts.

We're on Nomad 1.0.4

May 04 '21 16:05 Oloremo

We have hit this too. Having a few stale iptables rules is fine until new allocations are assigned with ports used by the stale rules.

CNI-DN-b545c28573a241d01dadd  tcp  --  0.0.0.0/0            0.0.0.0/0            /* dnat name: "nomad" id: "416a88b0-76fa-e270-6c80-7f102216ca13" */ multiport dports 23674,23642
CNI-DN-b545c28573a241d01dadd  udp  --  0.0.0.0/0            0.0.0.0/0            /* dnat name: "nomad" id: "416a88b0-76fa-e270-6c80-7f102216ca13" */ multiport dports 23674,23642
CNI-DN-7833d1be7094c0db0c99d  tcp  --  0.0.0.0/0            0.0.0.0/0            /* dnat name: "nomad" id: "aace5b7e-8e97-43d4-5705-06a94f326aee" */ multiport dports 23674,22463,23641,21075
CNI-DN-7833d1be7094c0db0c99d  udp  --  0.0.0.0/0            0.0.0.0/0            /* dnat name: "nomad" id: "aace5b7e-8e97-43d4-5705-06a94f326aee" */ multiport dports 23674,22463,23641,21075

In the above example, 416a88b0-76fa-e270-6c80-7f102216ca13 was removed while aace5b7e-8e97-43d4-5705-06a94f326aee is a new allocation. Clients of the service will get an error Unable to establish connection to 10.133.67.138:23674 while trying to send requests to the running allocation.

Even though we almost always drain nodes before manual nomad restart, sometimes nomad gets restarted automatically by systemd due to consul failure or other reasons, in which case we'd end up having stale iptables rules. So it'd be great if the iptables rule reconciliation can be implemented. 🙏 🤞

Aug 10 '21 05:08 joec4i

I'm seeing a slightly different trigger of this issue, but with the same root cause and end results. We use puppet to manage some iptables rules on the host[0]. When puppet makes a ruleset change, it triggers the new ruleset to be persisted (on EL, that's via iptables-save to /etc/sysconfig/iptables). This saved ruleset includes all the permanent rules we're managing via puppet, but also all the "transient" rules installed by nomad/cni plugins. Therefore the next time the ruleset is loaded (e.g. after host reboot), the iptables chains are pre-filled with stale rules from historic tasks. Currently nothing is cleaning those up, and since nomad is appending new rules, the saved rules are higher up in the NAT chains. This is causing particular pain where our ingress containers which listen on static ports 80, 443 get caught in the cached ruleset, because then after a reboot, the NAT rules redirect the traffic to a blackhole instead of the running ingress container. It also affects tasks that don't use static port numbers, but then the pain is deferred to when the port eventually gets reused and is harder to track down.

This is on nomad 1.1.5.

[0] One of the rules we're managing is -A FORWARD -m conntrack --ctstate INVALID -j DROP to drop packets which the conntrack module thinks are not valid, will not apply NAT. The kernel treats these as unsolicited packets, returning a TCP-RST. This tears down the connection between the container and the external service, causing disruption. There seem to be quite a few bug reports about this relating to docker, kubernetes etc, and this rule is the widely accepted workaround.

Oct 05 '21 13:10 optiz0r

I've just re-read the upgrade guide for (in preparation for 1.2.0), and I think the changes in that 1.1.0 to append the CNI rules rather than insert at the top of the chain is what made this issue more noticeable (https://github.com/hashicorp/nomad/pull/10181). Previously, had transient rules been persisted, the next time an alloc was started, the new iptables rules would be inserted above the stale ones and thus take precedence. Now they are added below the stale rules, so traffic is matched and blackholed by the stale rules.

Nov 19 '21 12:11 optiz0r

As noted in #11901, this affects us quite badly right now (though we're using nftables as opposed to iptables, the issue and result is the same). Whenever an unclean stop or agent restart has occurred for a job with static ports, those ports will (silently, no errors, local checks seem to succeed) fail to bind again until reboot or stale rules are manually removed.

While at first glance it looked like this was a regression caused by the priority-inversion in #10181 (as noted by @optiz0r ), that PR looks concerned only with the with NOMAD-ADMIN chain while in our case the issue is with stale rules blackholing dports under the CNI-HOSTPORT-DNAT chain (or maybe they're indeed the same after CNI does its magic?).

Jan 24 '22 08:01 3nprob

This is becoming a major security and stability issue as we are seeing allocations try to forward from ports that already have rules in iptables, and requests bound for them are getting forwarded based on the stale iptables rule. Is there anything we can do to ensure this gets prioritized? Or can someone share a cleanup script they have been using?

Here is the error log around the time it fails to cleanup the stale allocation:

Logs

containerd[3148]: time="2022-05-24T07:05:43.326040572Z" level=info msg="shim disconnected" id=d13e150a783b3a72482e859901590f7d002b71c43c36ffe2f0d46aecca64e794
containerd[3148]: time="2022-05-24T07:05:43.326107477Z" level=error msg="copy shim log" error="read /proc/self/fd/17: file already closed"
dockerd[3364]: time="2022-05-24T07:05:43.326090953Z" level=info msg="ignoring event" container=d13e150a783b3a72482e859901590f7d002b71c43c36ffe2f0d46aecca64e794 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
consul[3318]: 2022-05-24T07:05:43.342Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '_nomad-task-4984d1ec-c7ab-09ad-883f-3c66cc9d1cb7-group-<redacted>-8083' not in agent state" index
consul[3318]: 2022-05-24T07:05:43.342Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '_nomad-task-4984d1ec-c7ab-09ad-883f-3c66cc9d1cb7-group-<redacted>-8083' not in agent state" index
consul[3318]: 2022-05-24T07:05:43.349Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '_nomad-task-4984d1ec-c7ab-09ad-883f-3c66cc9d1cb7-group-<redacted>-8083' not in agent state" index
consul[3318]: 2022-05-24T07:05:43.349Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '_nomad-task-4984d1ec-c7ab-09ad-883f-3c66cc9d1cb7-group-<redacted>-8083' not in agent state" index
consul[3318]: 2022-05-24T07:05:43.380Z [WARN]  agent: Failed to deregister service: service=_nomad-task-4984d1ec-c7ab-09ad-883f-3c66cc9d1cb7-group-<redacted>-8083-sidecar-proxy error="Service "_nomad-task-4984d1ec-c7ab-09ad-883f-3c66cc9d1cb7-group-<redacted>
consul[3318]: 2022-05-24T07:05:43.380Z [WARN]  agent: Failed to deregister service: service=_nomad-task-4984d1ec-c7ab-09ad-883f-3c66cc9d1cb7-group-<redacted>-8083-sidecar-proxy error="Service "_nomad-task-4984d1ec-c7ab-09ad-883f-3c66cc9d1cb7-group-<redacted>
kernel: docker0: port 1(vethc6029f5) entered disabled state
kernel: vethb5b69c3: renamed from eth0
kernel: docker0: port 1(vethc6029f5) entered disabled state
kernel: device vethc6029f5 left promiscuous mode
kernel: docker0: port 1(vethc6029f5) entered disabled state
systemd[1]: Stopping Nomad...

Jun 02 '22 17:06 m-wynn

Hit this same issue today. Performed some manual iptables clean up on a problem client. Here are my notes in case this helps anyone else:

Seems likely it was caused by some combination of quick nomad job allocation stop and re-deployment and/or nomad systemd service restarts, possibly before cleanup of a stopped allocation could be cleaned up.
Symptom first noticed indicitive of a problem was of failed consul healthcheck, where the port that should be the listener on the host level and is properly bridged into the container as seen from the Nomad UI and nomad job config just isn't working.
Tcpdump on the client machine to the port listener which doesn't appear to work shows SYN packet sent to a nomad bridge IP ([S]) and a RESET packet immediately returned ([R.])
Looking at the tcpdump output shows that the packet is actually being sent to the wrong nomad bridge IP, and further looking at the IP tables, shows that there are duplicate (or more) rules set up for port forwarding the listener into the Nomad bridge due to unclean allocation handling of an old allocation.
On our OS, cleaning up the iptables can be done with a client reboot, but cleaning up by hand in these occurrences can be also be done.
General procedure -- check iptables (command refs below). There will be iptables entries which show comments of allocation ID association. Check for non-existent allocation IDs being present and/or conflicting with existing allocation iptables rules. If such allocation IDs are seen, they will also be seen to be associated with user defined iptables chains starting with CNI-[a-f0-9] and CNI-DN-[a-f0-9]. These can all be purged with the example cmds below:

# In this case, old rules from a nomad bridge IP with no active allocation superseded the correct rules to a nomad bridge IP with an active allocation listener.

# Find references to obsolete iptables rules with missing allocation IDs (first cmds contains all info, addnl cmds are a bit more verbose)
iptables-save
iptables -t filter -L -v -n
iptables -t nat -L -v -n

# Delete rules specific to the bad allocation IP from the filter and NAT tables
iptables -t filter -D CNI-FORWARD -d 172.26.66.2/32 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
iptables -t filter -D CNI-FORWARD -s 172.26.66.2/32 -j ACCEPT

# Delete references to the user defined chains of the non-existent allocation from the filter and NAT tables
iptables -t nat -D POSTROUTING -s 172.26.66.2/32 -m comment --comment "name: \"nomad\" id: \"8145e50e-b164-693e-2136-8055fde5ad10\"" -j CNI-fe647aec064bf60036a312df
iptables -t nat -D CNI-HOSTPORT-DNAT -p tcp -m comment --comment "dnat name: \"nomad\" id: \"8145e50e-b164-693e-2136-8055fde5ad10\"" -m multiport --dports 8008,5432 -j CNI-DN-fe647aec064bf60036a31
iptables -t nat -D CNI-HOSTPORT-DNAT -p udp -m comment --comment "dnat name: \"nomad\" id: \"8145e50e-b164-693e-2136-8055fde5ad10\"" -m multiport --dports 8008,5432 -j CNI-DN-fe647aec064bf60036a31

# Delete rules from user defined chains of the non-existent allocation
iptables -t nat -F CNI-DN-fe647aec064bf60036a31
iptables -t nat -F CNI-fe647aec064bf60036a312df

# Delete the user defined chains from the non-existent allocation
iptables -t nat -X CNI-DN-fe647aec064bf60036a31
iptables -t nat -X CNI-fe647aec064bf60036a312df

Jun 24 '22 18:06 johnalotoski

Any update on this issue? Facing it and it's causing very annoying stability issues on a select few hosts.

Aug 27 '22 03:08 LukeFlynn

Please fix this or offer a proper solution, I don't care if we have to run a script to do it, but something that can be automated would be nice. We've positioned our whole infrastructure on Nomad, and this is killing us. We would prefer not to jump ship, but I'm still concerned how this isn't affecting other users?

Oct 02 '22 20:10 LukeFlynn

Affects us as well

Oct 02 '22 20:10 Oloremo

Hey folks, we update issues when we're working on them. I can say this is on our roadmap but I can't really give a timeline.

Oct 03 '22 13:10 tgross

What are we supposed to do in the meantime?

Oct 06 '22 19:10 LukeFlynn

This issue is pretty devastating for the use case at my PoB. Is there a workaround that can be implemented until an official fix comes out? At the moment we have to do full reboots of Nomad and take our whole network offline when we run into it.

Oct 07 '22 03:10 FerusGrim

@johnalotoski has posted a process above; if you were to run that as a periodic task (or just a cron job) that'd clean up the iptables.

Oct 07 '22 12:10 tgross

Little bit difficult to script around that honestly, since you're also having to compare to what allocations exist, and what host they are on

Oct 08 '22 04:10 LukeFlynn

Okay, we're moving on from this, we can't support our org with this

Oct 17 '22 03:10 LukeFlynn

nomad nomad copied to clipboard

iptables entries are not reconciled

Nomad version

Reproduction steps

Logs

nomad
nomad copied to clipboard