behavioral-model icon indicating copy to clipboard operation
behavioral-model copied to clipboard

TCP connection reset by server using docker containers

Open mcvzon51 opened this issue 4 years ago • 10 comments

I'm trying to connect docker containers by using the p4 switch bmv2 model. Two containers on the same host (connected to the same switch) can establish a TCP connection but two containers on different hosts can't. After the server receives the syn packet, the server replies with a syn,ack packet that has the reset bit set. Before sending a packet to a new host NAT will be applied the action responsible for this looks like:

action nat(port, mac_src, mac_dst, host_ip, dst_ip) {
        modify_field(ether_hdr.src, mac_src);
	modify_field(ether_hdr.dst, mac_dst);
	modify_field(ipv4_hdr.src, host_ip);
        modify_field(ipv4_hdr.dst, dst_ip);
        modify_field(standard_metadata.egress_spec, port);
}

Then in the other host the switch translates the public addresses back to private addresses:

action set_mac(mac_src, mac_dst, src, dst, port) {
     modify_field(ether_hdr.src, mac_src);
     modify_field(ether_hdr.dst, mac_dst);
     modify_field(ipv4_hdr.src, src);
     modify_field(ipv4_hdr.dst, dst);
     modify_field(standard_metadata.egress_spec, port);
}

I don't think translating the addresses is the problem since the two containers can ping each other. I think it has something to do with an invalid checksum somehow. The TCP checksum is computed like this:

field_list tcp_checksum_list {
        ipv4_hdr.src;
        ipv4_hdr.dst;
        8'0;
        ipv4_hdr.proto;
        meta.tcpLength;
        tcp_hdr.src;
        tcp_hdr.dst;
        tcp_hdr.seq;
        tcp_hdr.ack;
        tcp_hdr.offset;
        tcp_hdr.resrv;
        tcp_hdr.flags;
        tcp_hdr.window;
        tcp_hdr.urgent;
        payload;
}

field_list_calculation tcp_checksum {
    input {
        tcp_checksum_list;
    }
    algorithm : csum16;
    output_width : 16;
}

calculated_field tcp_hdr.checksum {
    update tcp_checksum if(valid(tcp_hdr));
}

One question is: do we need to include the tcp options its checksum? and does the switch change the options? When i run an iperf3 session between local (so no NAT is applied) containers i get the following log: switch3_log_local_checksum_not_correct Actually i was a bit surprised to see that the checksum is updated and even more surprised that the checksum is not correct. I'm realy wondering what's going wrong.

mcvzon51 avatar Jun 10 '20 13:06 mcvzon51

What is the output of simple_switch --version, assuming you are using the simple_switch process. Or if you are using a different command, which one is it, and what version?

Given the evidence that you have that perhaps the TCP checksum is incorrect, have you tried capturing the full packet contents at one or more links in your system, and examined them in Wireshark to see if Wireshark also shows them as having an incorrect TCP header checksum? If so, isolating the last point where the checksum is good in the packet's path, and the first point where it is bad, would help isolate the point where the problem is introduced.

jafingerhut avatar Jun 10 '20 13:06 jafingerhut

The version is: 1.13.0-5bb6075d I've checked the checksum from the first syn packet (syn = 1) leaving the client container and arriving in the server container but they are not the same while all the parameters of this packet changed by the nat process in the switch are the same and are in the same order. I have also check the tcp checksum before and after a packet traversed the switch in the scenario with local containers. Here as well the tcp checksum changes inside the switch even though all the parameters in the packet do not change (the switch does no nat. It only sets the egress port). But between two local containers the three way handshake succeeds as well as the iperf3 tests.

mcvzon51 avatar Jun 10 '20 14:06 mcvzon51

I think sharing a wireshark packet trace, along with enough information to identify the packets (e.g. hosts' IP addresses) would be helpful to help you troubleshoot this.

antoninbas avatar Jun 10 '20 22:06 antoninbas

I work in a VMs on remote hosts so i added the data from tshark below the topology. The topology used is: top_1

Client VM
fields:
-e eth 
-e ip
-e tcp -e tcp.flags.syn -e tcp.flags -e tcp.checksum

client container eth0: 	
Ethernet II, Src: 02:42:ac:11:00:05, Dst: 02:42:ac:11:00:02
 Internet Protocol Version 4, Src: 172.17.0.5, Dst: 172.17.0.2
Transmission Control Protocol, Src Port: 49604, Dst Port: 5201, Seq: 0, Len: 0	
1	0x00000002	0x00005858

veth before switch: 		
Ethernet II, Src: 02:42:ac:11:00:05, Dst: 02:42:ac:11:00:02
 Internet Protocol Version 4, Src: 172.17.0.5, Dst: 172.17.0.2
Transmission Control Protocol, Src Port: 49604, Dst Port: 5201, Seq: 0, Len: 0	
10x00000002	0x00005858

client VM eth0:      	
Ethernet II, Src: 3e:25:f7:61:ab:13, Dst: 1e:11:08:c9:9a:88
 Internet Protocol Version 4, Src: <public ip client VM>, Dst: <public ip server VM>
Transmission Control Protocol, Src Port: 49604, Dst Port: 5201, Seq: 0, Len: 0	
10x00000002	0x0000e195

-----------------------------------------------------------------------------------
Server VM

fields:
-e eth 
-e ip 
-e tcp -e tcp.flags.syn -e tcp.flags -e tcp.checksum

server VM eth0:   	
Ethernet II, Src: 3e:25:f7:61:ab:13, Dst: 1e:11:08:c9:9a:88
Internet Protocol Version 4, Src: <public ip client VM>, Dst: <public ip server VM>
Transmission Control Protocol, Src Port: 49604, Dst Port: 5201, Seq: 0, Len: 0	
10x00000002	0x0000e195

veth after switch:  
Ethernet II, Src: 02:42:ac:11:00:05, Dst: 02:42:ac:11:00:02
Internet Protocol Version 4, Src: 172.17.0.5, Dst: 172.17.0.2
Transmission Control Protocol, Src Port: 49604, Dst Port: 5201, Seq: 0, Len: 0	
10x00000002	0x0000b261


server container eth0: 
Ethernet II, Src: 02:42:ac:11:00:05, Dst: 02:42:ac:11:00:02
Internet Protocol Version 4, Src: 172.17.0.5, Dst: 172.17.0.2
Transmission Control Protocol, Src Port: 49604, Dst Port: 5201, Seq: 0, Len: 0	
1	0x00000002 0x0000b261

I noticed that if the syn flag is 0 it's not printed. I do have the full packet info. If you you think it will help i will add it. Btw the above code of the tcp checksum update is correct? i don't miss any field(s)/added to much fields?

mcvzon51 avatar Jun 11 '20 09:06 mcvzon51

Your P4 program looks correct (although it's a weird NAT implementation). It's hard to find a reference P4_14 program these days :), but you can compare yours with this one: https://github.com/p4lang/tutorials/blob/e7e6899d5c5c90fd3033d9bc22e23a84e48ba81c/examples/simple_nat/p4src/simple_nat.p4

Can you indicate which of the packets above have an incorrect TCP checksum?

Also, you mention that the reset bit is set in the server's reply, but I don't think an invalid TCP checksum would cause that.

antoninbas avatar Jun 12 '20 00:06 antoninbas

Actually i used that as a bit of a guide:). I don't know exactly if the checksum is considered invalid but i consider the TCP checksum invalid of the packet arriving at the server container because all the parameters are the same but not the checksum. Since the reply packet of the server contains the reset bit i assumed that the TCP checksum is invalid. If it's not the TCP checksum i don't know what else it could be. I checked the containers ufw status that is: inactive, there are no firewall rules and all the policies of the chains Accept all traffic. And if i use the same container with another local container everything works fine. Because of that i assumed that the firewall could be ruled out as being to cause for the failed connection. I'm quite new to this so i thought maybe my NAT implementation is the problem. But when i ping the remote container everything works fine as well. The NAT process does not change parameters from the TCP header the only parameter changing at TCP level is the update of its checksum. What i didn't expect (because at this point nothing has been changed in the packet) was that the switch considers the TCP checksum invalid at arrival? Is that correct? I also was surprised that the TCP checksum has to be update even though no addresses were translated. Is this common? Anyway i'm quite new and quite stuck, not a good combination:)

mcvzon51 avatar Jun 12 '20 11:06 mcvzon51

Btw in case i want to implement https://github.com/p4lang/tutorials/blob/e7e6899d5c5c90fd3033d9bc22e23a84e48ba81c/examples/simple_nat/p4src/simple_nat.p4 with respect to the above shown topology and addresses. Which rules do i have to add to the CLI assuming the container is connected to port 0 of the switch and eth0 (VM) to port 1? i don't completely understand following rules:

table_add nat nat_miss_ext_to_int 1 1 1 0.0.0.0&&&0.0.0.0 0.0.0.0&&&0.0.0.0 0&&&0 0&&&0 => 99
table_add nat nat_miss_int_to_ext 0 1 1 0.0.0.0&&&0.0.0.0 0.0.0.0&&&0.0.0.0 0&&&0 0&&&0 => 99

which addresses does this represent: 0.0.0.0&&&0.0.0.0 and 0&&&0?

mcvzon51 avatar Jun 12 '20 11:06 mcvzon51

You can find fairly detailed documentation of the behavior and syntax of table_add in this page (search for 'table_add'), including the meaning of the &&& syntax: https://github.com/p4lang/behavioral-model/blob/master/docs/runtime_CLI.md

0&&&0 in a ternary field of a P4 table means "match all possible values that can be looked up for that field".

Regarding the issue you believe you are having with checksums, it can be tedious to debug, I know, but the best evidence you can get for the location of the problem is to try to record the full contents of the packet near the source, and near the destination. If the one near the destination has a bad checksum, but the one near the source has a good checksum, then you know something between those two points introduced a problem.

It could be the simple_switch process, perhaps. But that requires strong evidence to show it was simple_switch, e.g. record the packet going into the simple_switch process, and the one coming out of it, and show it was good going in and bad coming out.

Other possible places where problems might be introduced: some Ethernet and virtual ethernet interfaces have options for offloading TCP checksum calculation "to the NIC", even if that NIC is not a physical card, but software implementing a virtual ethernet interface.

jafingerhut avatar Jun 13 '20 13:06 jafingerhut

Other possible places where problems might be introduced: some Ethernet and virtual ethernet interfaces have options for offloading TCP checksum calculation "to the NIC", even if that NIC is not a physical card, but software implementing a virtual ethernet interface.

+1 to that. You say that packets arriving at the server have a different checksum than the packets leaving the client, despite the contents being the same. But unless you disable TCP checksum offloading on the containers' veth interfaces, it is expected that the checksum will be invalid when you capture the packet before the first switch. After that, bmv2 will compute the checksum and assuming the P4 program is correct, the checksum will be correct.

antoninbas avatar Jun 15 '20 17:06 antoninbas

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment, or this will be closed in 180 days

github-actions[bot] avatar Sep 01 '22 00:09 github-actions[bot]