moby icon indicating copy to clipboard operation
moby copied to clipboard

TCP-RESET Packets are not masqueraded

Open goekhanm opened this issue 9 years ago • 16 comments
trafficstars

Description of problem: A few outgoing docker TCP-packets (just the TCP RESET-Packets !) are not masqueraded. This connections are not closed properly which leads to many open timing out connections.

docker version:

Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5
 Built:        Fri Nov 20 13:25:01 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5
 Built:        Fri Nov 20 13:25:01 UTC 2015
 OS/Arch:      linux/amd64

docker info:

Containers: 24
Images: 152
Server Version: 1.9.1
Storage Driver: devicemapper
 Pool Name: docker-253:1-655723-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem:
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 6.73 GB
 Data Space Total: 107.4 GB
 Data Space Available: 90.2 GB
 Metadata Space Used: 8.729 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.139 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.18.17-13.el7.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 4
Total Memory: 23.59 GiB

uname -a:

Linux pcp-cons-vm1.pcp.asudc.net 3.18.17-13.el7.x86_64 #1 SMP Wed Jul 22 14:20:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Environment details (AWS, VirtualBox, physical, etc.): VMWare VM

How reproducible:

Steps to Reproduce:

  1. Initial setup:

Container are build with docker-maven-plugin BaseImage ist centos:7 runcmd:

rm /etc/localtime
ln -s /usr/share/zoneinfo/Europe/Berlin /etc/localtime
curl -jksSLH "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u66-b17/jdk-8u66-linux-x64.rpm > jdk-8u66-linux-x64.rpm
yum -y install jdk-8u66-linux-x64.rpm
rm -f jdk-8u66-linux-x64.rpm
  1. Container are started with Ansible:
- docker:
    image: "{{ image_name }}:{{ image_version }}"
    name: "{{ application_name }}"
    expose: "{{ application_port }}"  # workaround for https://github.com/ansible/ansible-modules-core/issues/147
    ports: "{{ application_port }}:{{ application_port }},{{ application_jmx_port }}:{{ application_jmx_port }}"
    env:
      application.stage: "{{ stage }}"
      application.host: "{{ inventory_hostname }}"
    state: "reloaded"
    restart_policy: "always"
    username: "{{ username }}"
    password: "{{ password }}"
    email: "{{ email }}"
  1. TCPDUMP the Networktraffic
$ tcpdump -vvv '(src net 172.17.0.0/16 and not dst net 172.17.0.0/16)' -i eno16780032

Actual Results:

tcpdump: listening on eno16780032, link-type EN10MB (Ethernet), capture size 65535 bytes
09:40:20.765078 IP (tos 0x0, ttl 63, id 63793, offset 0, flags [DF], proto TCP (6), length 40)
    172.17.0.94.44220 > XXXXXXX.https: Flags [R], cksum 0x733e (correct), seq 3364944719, win 0, length 0
09:40:23.114700 IP (tos 0x0, ttl 63, id 64957, offset 0, flags [DF], proto TCP (6), length 40)
    172.17.0.94.44386 > XXXXXXX.https: Flags [R], cksum 0x04f6 (correct), seq 1452858090, win 0, length 0
09:40:30.599823 IP (tos 0x0, ttl 63, id 27499, offset 0, flags [DF], proto TCP (6), length 40)
    172.17.0.94.41293 > XXXXXXX.https: Flags [R], cksum 0x70d6 (correct), seq 2154385743, win 0, length 0
09:40:43.076053 IP (tos 0x0, ttl 63, id 7793, offset 0, flags [DF], proto TCP (6), length 40)
    172.17.0.94.45254 > XXXXXXX.https: Flags [R], cksum 0xae49 (correct), seq 3912735635, win 0, length 0
09:40:47.884495 IP (tos 0x0, ttl 63, id 11156, offset 0, flags [DF], proto TCP (6), length 40)
    172.17.0.94.45514 > XXXXXXX.https: Flags [R], cksum 0x3db8 (correct), seq 509662712, win 0, length 0

Expected Results:

tcpdump is empty (Every Outgoing TCP-Packet is masqueraded)

Additional info:

iptables -t nat -L:

Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
DOCKER     all  --  anywhere             anywhere             ADDRTYPE match dst-type LOCAL

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
DOCKER     all  --  anywhere             anywhere             ADDRTYPE match dst-type LOCAL

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
MASQUERADE  all  --  anywhere             anywhere             ADDRTYPE match src-type LOCAL
MASQUERADE  all  --  172.17.0.0/16        anywhere
MASQUERADE  tcp  --  172.17.0.14          172.17.0.14          tcp dpt:XXXX2
MASQUERADE  tcp  --  172.17.0.14          172.17.0.14          tcp dpt:XXXX1

Chain DOCKER (2 references)
target     prot opt source               destination
DNAT       tcp  --  anywhere             anywhere             tcp dpt:19392 to:172.17.0.14:XXXX2
DNAT       tcp  --  anywhere             anywhere             tcp dpt:19391 to:172.17.0.14:XXXX1

iptables -L:

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Chain DOCKER (1 references)
target     prot opt source               destination
ACCEPT     tcp  --  anywhere             172.17.0.14          tcp dpt:XXXX2
ACCEPT     tcp  --  anywhere             172.17.0.14          tcp dpt:XXXX1

goekhanm avatar Dec 14 '15 12:12 goekhanm

USER POLL

The best way to get notified of updates is to use the Subscribe button on this page.

Please don't use "+1" or "I have this too" comments on issues. We automatically collect those comments to keep the thread short.

The people listed below have upvoted this issue by leaving a +1 comment:

@subsend

GordonTheTurtle avatar Feb 25 '16 08:02 GordonTheTurtle

/cc @mavenugo

thaJeztah avatar Feb 25 '16 10:02 thaJeztah

ping @mavenugo Could you take a look at this, please?

unclejack avatar Mar 07 '16 10:03 unclejack

@goekhanm I don't have a particular idea on this issue. But I will ask a few questions that might help.

  1. Could you please try 1.10.3 and confirm the behavior ?
  2. Do you have userland-proxy disabled ?
  3. Do you have firewalld enabled ?
  4. Can you please capture the output of iptables -nvL which will also give us details on packet counter.

BTW, I tried this scenario in latest docker engine in ubuntu 14.04 and it seems to work just fine.

mavenugo avatar Mar 13 '16 16:03 mavenugo

Is this still an issue?

AkihiroSuda avatar Jan 11 '17 15:01 AkihiroSuda

The issue is still reproducible and we're facing it.

upietz avatar Mar 10 '17 08:03 upietz

@upietz could you provide more info about your setup (output of docker version, and docker info), and additional info from @mavenugo's comment above https://github.com/docker/docker/issues/18630#issuecomment-195989351

thaJeztah avatar Mar 10 '17 08:03 thaJeztah

docker version Client: Version: 1.12.2 API version: 1.24 Go version: go1.6.3 Git commit: bb80604 Built: Tue Oct 11 17:43:41 2016 OS/Arch: linux/amd64

Server: Version: 1.12.2 API version: 1.24 Go version: go1.6.3 Git commit: bb80604 Built: Tue Oct 11 17:43:41 2016 OS/Arch: linux/amd64

docker info Containers: 100 Running: 25 Paused: 0 Stopped: 75 Images: 22 Server Version: 1.12.2 Storage Driver: aufs Root Dir: /var/lib/docker/aufs Backing Filesystem: extfs Dirs: 544 Dirperm1 Supported: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge null overlay host Swarm: inactive Runtimes: runc Default Runtime: runc Security Options: Kernel Version: 3.16.0-4-amd64 Operating System: Debian GNU/Linux 8 (jessie) OSType: linux Architecture: x86_64 CPUs: 24 Total Memory: 47.26 GiB Name: mesos-slave5 ID: HGK2:BUX2:CUKS:LLW5:FLLR:TCEU:BYP4:5J66:V6Q4:MT2O:QFM5:XS44 Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ WARNING: No kernel memory limit support WARNING: No cpu cfs quota support WARNING: No cpu cfs period support Insecure Registries: 127.0.0.0/8

iptables -nvL

Chain INPUT (policy ACCEPT 647K packets, 75M bytes) pkts bytes target prot opt in out source destination

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes) pkts bytes target prot opt in out source destination
9119M 14T DOCKER-ISOLATION all -- * * 0.0.0.0/0 0.0.0.0/0
29G 33T DOCKER all -- * docker0 0.0.0.0/0 0.0.0.0/0
23G 21T ACCEPT all -- * docker0 0.0.0.0/0 0.0.0.0/0 ctstate RELATED,ESTABLISHED 28G 48T ACCEPT all -- docker0 !docker0 0.0.0.0/0 0.0.0.0/0
0 0 ACCEPT all -- docker0 docker0 0.0.0.0/0 0.0.0.0/0

Chain OUTPUT (policy ACCEPT 1532K packets, 694M bytes) pkts bytes target prot opt in out source destination

Chain DOCKER (1 references) pkts bytes target prot opt in out source destination
374K 37M ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.16 tcp dpt:5051 562K 46M ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.16 tcp dpt:5050 562K 48M ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.20 tcp dpt:4567 374K 37M ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.14 tcp dpt:5051 62M 14G ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.14 tcp dpt:5050 936K 68M ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.23 tcp dpt:4000 404M 1761G ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.9 tcp dpt:3000 13M 1434M ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.11 tcp dpt:8080 12M 1351M ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.18 tcp dpt:8080 12M 1308M ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.19 tcp dpt:8080 1819K 151M ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.4 tcp dpt:8080 324K 33M ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.24 tcp dpt:8080 318K 32M ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.32 tcp dpt:5051 55M 13G ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.32 tcp dpt:5050 1666K 186M ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.33 tcp dpt:8080 362K 31M ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.13 tcp dpt:4567 250M 1096G ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.10 tcp dpt:3000 8503K 926M ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.26 tcp dpt:8080 0 0 ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.3 tcp dpt:9010 3538K 1131M ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.3 tcp dpt:4567 0 0 ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.6 tcp dpt:9010 3544K 1134M ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.6 tcp dpt:4567 0 0 ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.7 tcp dpt:9010 3560K 1141M ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.7 tcp dpt:4567 5439K 27G ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.15 tcp dpt:8080 1259K 597M ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.17 tcp dpt:8080 1850K 9168M ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.5 tcp dpt:8080 590K 2804M ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.8 tcp dpt:8080 9625 1827K ACCEPT tcp -- !docker0 docker0 0.0.0.0/0 172.17.0.12 tcp dpt:4567

Chain DOCKER-ISOLATION (1 references) pkts bytes target prot opt in out source destination
9119M 14T RETURN all -- * * 0.0.0.0/0 0.0.0.0/0

userland-proxy is enabled

upietz avatar Mar 10 '17 08:03 upietz

Kernel Version: 3.16.0-4-amd64 probably has something to do with this issue. That kernel is most likely not receiving the relevant fixes.

Can you reproduce this with another distribution, kernel or distribution and kernel?

unclejack avatar Mar 10 '17 10:03 unclejack

I am having the same problem -- pretty confused! Pretty simple to repro though: https://gist.github.com/moribellamy/43649b23836786a65bc583c3210a8be5

I've set up a simple docker server that listens for a TCP connection and sends a RST packet, immediately, to the first client. Then it exits.

If you initiate the connection inside the container (telnet will do) and run tcpdump, you get this (expected output, notice the R packets)

root@4fcb979b150c:/# tcpdump -i lo -e 'port 12345' # inside the docker container tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes 00:00:46.990936 00:00:00:00:00:00 (oui Ethernet) > 00:00:00:00:00:00 (oui Ethernet), ethertype IPv4 (0x0800), length 74: localhost.57382 > localhost.12345: Flags [S], seq 3411329055, win 43690, options [mss 65495,sackOK,TS val 1613614 ecr 0,nop,wscale 7], length 0 00:00:46.990960 00:00:00:00:00:00 (oui Ethernet) > 00:00:00:00:00:00 (oui Ethernet), ethertype IPv4 (0x0800), length 74: localhost.12345 > localhost.57382: Flags [S.], seq 1273088479, ack 3411329056, win 43690, options [mss 65495,sackOK,TS val 1613614 ecr 1613614,nop,wscale 7], length 0 00:00:46.990980 00:00:00:00:00:00 (oui Ethernet) > 00:00:00:00:00:00 (oui Ethernet), ethertype IPv4 (0x0800), length 66: localhost.57382 > localhost.12345: Flags [.], ack 1, win 342, options [nop,nop,TS val 1613614 ecr 1613614], length 0 00:00:46.991323 00:00:00:00:00:00 (oui Ethernet) > 00:00:00:00:00:00 (oui Ethernet), ethertype IPv4 (0x0800), length 66: localhost.12345 > localhost.57382: Flags [R.], seq 1, ack 1, win 342, options [nop,nop,TS val 1613614 ecr 1613614], length 0

If you initiate the connection from outside the container (e.g. docker run -p 12345:12345 local:rst), you get no such reset packets

tcpdump -i lo0 -e 'port 12345' # on the host machine 17:20:15.220923 AF IPv6 (30), length 88: localhost.56449 > localhost.italk: Flags [S], seq 212628271, win 65535, options [mss 16324,nop,wscale 5,nop,nop,TS val 197372606 ecr 0,sackOK,eol], length 0 17:20:15.221009 AF IPv6 (30), length 88: localhost.italk > localhost.56449: Flags [S.], seq 3769277609, ack 212628272, win 65535, options [mss 16324,nop,wscale 5,nop,nop,TS val 197372606 ecr 197372606,sackOK,eol], length 0 17:20:15.221020 AF IPv6 (30), length 76: localhost.56449 > localhost.italk: Flags [.], ack 1, win 12743, options [nop,nop,TS val 197372606 ecr 197372606], length 0 17:20:15.221028 AF IPv6 (30), length 76: localhost.italk > localhost.56449: Flags [.], ack 1, win 12743, options [nop,nop,TS val 197372606 ecr 197372606], length 0 17:20:15.222191 AF IPv6 (30), length 76: localhost.italk > localhost.56449: Flags [F.], seq 1, ack 1, win 12743, options [nop,nop,TS val 197372607 ecr 197372606], length 0 17:20:15.222213 AF IPv6 (30), length 76: localhost.56449 > localhost.italk: Flags [.], ack 2, win 12743, options [nop,nop,TS val 197372607 ecr 197372607], length 0 17:20:15.222276 AF IPv6 (30), length 76: localhost.56449 > localhost.italk: Flags [F.], seq 1, ack 2, win 12743, options [nop,nop,TS val 197372607 ecr 197372607], length 0 17:20:15.222310 AF IPv6 (30), length 76: localhost.italk > localhost.56449: Flags [.], ack 2, win 12743, options [nop,nop,TS val 197372607 ecr 197372607], length 0

in this particular case, telnet does the right thing in each case (shuts down). My guess is because telnet responds to the server FIN with its own FIN, but i'm not a TCP guru. In general, though, some applications need that RST packet (for reasons cited in the initial report)

EDIT: here is my docker info. running on mac osx. addressing @unclejack's concern that it may be the kernel version.

[0] 05:21:45 ~/rooms/rst$ docker info Containers: 28 Running: 1 Paused: 0 Stopped: 27 Images: 127 Server Version: 17.06.0-ce Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: cfb82a876ecc11b5ca0977d1733adbe58599088a runc version: 2d41c047c83e09a6d61d464906feb2a2f3c52aa4 init version: 949e6fa Security Options: seccomp Profile: default Kernel Version: 4.9.31-moby Operating System: Alpine Linux v3.5 OSType: linux Architecture: x86_64 CPUs: 4 Total Memory: 1.952GiB Name: moby ID: S6N2:K4Z7:6CUL:6S2X:HEV2:DBG2:LGE6:YH35:IGL5:BCZI:QKMJ:MCX2 Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): true File Descriptors: 29 Goroutines: 51 System Time: 2017-07-11T00:31:15.184817286Z EventsListeners: 1 No Proxy: *.local, 169.254/16 Registry: https://index.docker.io/v1/ Experimental: true Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false

moribellamy avatar Jul 11 '17 00:07 moribellamy

@mavenugo i'll ask just because someone else did before :). Could you give a quick comment, or include someone who is willing to give a quick comment on my previous post? Based on my experiment above and other corroborating stories, docker containers only support TCP orderly resets and not TCP abortive resets.

Could be a subtle issue with our setups, but I bet my experiment will repro for most people. Whether or not I interpreted my experiment correctly, IDK.

moribellamy avatar Jul 17 '17 21:07 moribellamy

It's been a few years so at this point I'm just too curious. Does anyone know more about the scope of this bug? Is it limited to only the scenario where OSX is the host OS, or something?

I just find it odd that docker is so ubiquitous in the world of production software, but Layer 4 networking doesn't seem to work fully. Do people mostly use docker for higher level protocols which paper over this issue?

EDIT: Regarding earlier "old kernel" concerns, my repro instructions in https://github.com/moby/moby/issues/18630#issuecomment-314288184 still work for ubuntu 18.04.

moribellamy avatar Feb 12 '20 18:02 moribellamy

@moribellamy I think conntrack might be marking the packets as INVALID and not performing the nat. can you please investigate further by running iptables commands using the log module e.g. sudo iptables -A INPUT -j LOG -m state --state INVALID or viewing stats in the conntrack cli

arkodg avatar Feb 14 '20 23:02 arkodg

Any new information or updates regarding this possible bug?

mblazic avatar Apr 23 '21 22:04 mblazic

My theory is that this issue only manifests with OSX is the host OS. OSX docker is totally different from most production stacks, since it uses linux as hypervised by https://github.com/moby/hyperkit, in contrast to a native containerization setup you would see on a linux distro.

I just strongly suspect that if this ever happened in prod, someone would have fixed it.

Anecdotal data: I no longer have an issue with it because it was only affecting my development workstation. As soon as I found out the issue wasn't happening in prod (for me), I just tolerated this issue instead of taking the time to debug it.

moribellamy avatar Apr 29 '22 17:04 moribellamy

Today we noticed several production hosts this was happening with. Leaking TCP reset packets due to invalid state. A few of the hosts experiencing the issue on a regular basis were hitting conntrack state table max size.

We found some other hosts that we are still investigating, which are not exceeding conntrack state table max size. If I understand the problem, I think the NAT engine can't perform NAT on invalid packets, so the NAT engine just passes them on to the host interface.

Reducing the conntrack state table size or bombarding a host until the state table is full might be one way to generate INVALID ctstate packets that escape onto the host interface. Just a hypothesis at the moment.

I might try and replicate myself if I get some time to see if that might help narrow the scope on this issue.

"kernel: nf_conntrack: table full, dropping packet" << Should appear in the systemd journal once the conntrack table is full if anyone wants to try this as a potential way to replicate.

OS: Linux Distribution: CentOS 7

Several different versions of docker + Linux kernel versions experiencing the same problem.

UUIDNIE avatar Dec 06 '22 22:12 UUIDNIE

The issue seems to be opened for 7 years already. Any idea why it cannot be addressed?

testn avatar Feb 13 '23 02:02 testn