l3dsr icon indicating copy to clipboard operation
l3dsr copied to clipboard

tc nat can do the same job

Open chenhaiq opened this issue 4 years ago • 5 comments

The DADDR iptables plugin, iptables -t mangle -A INPUT -m dscp --dscp 1 -j DADDR --set-daddr=192.168.0.2, can be replaced by tc, so no plugin is need to use l3dsr:

tc qdisc add dev eth0 root handle 1: htb
tc qdisc add dev eth0 ingress
tc filter add dev eth0 parent ffff: protocol ip prio 1 u32 match u32 0x00040000 0x00ff0000 at 0 action nat ingress 192.168.0.3 192.168.0.2

where the u32 0x00040000 0x00ff0000 at 0 match Tos 0x4, which is dscp 1. 192.168.0.3 is read server ip, and 192.168.0.2 is vip.

chenhaiq avatar Sep 25 '19 09:09 chenhaiq

Thank you for the report!

Back when L3DSR was being implemented, we investigated the idea of using tc, but it was rejected. I no longer recall why, and in skimming my notes, I've been unable to find that reasoning. However, that was back nearly 10 some odd years ago in the RHEL 4 days for us. That reasoning should be (re-)discovered and see if it is still applicable (and either way documented).

What test cases and use cases have you tried your approach with so far and with what kernels? Have you tried it in combination with other tc and iptables rules to see how it interacts?

qbarnes avatar Sep 25 '19 18:09 qbarnes

I tried in ubuntu 1804+ kernel 4.15. There are 3 combinations:

  1. iptables -t mangle -A INPUT -m dscp --dscp 1 -j DADDR --set-daddr=192.168.0.2 works;

  2. tc filter add dev eth0 parent ffff: protocol ip prio 1 u32 match u32 0x00040000 0x00ff0000 at 0 action nat ingress 192.168.0.3 192.168.0.2 works exactly the same with #1 ;

  3. iptables -t nat -A PREROUTING -m dscp --dscp 1 -j DNAT --to-destination 192.168.0.2 does not work. I think this is why you wrote an iptables plugin.

Do you know why iptables nat does not work in this case? I can see that the destination address was changed from iptables log, but the application still responds real server ip address.

chenhaiq avatar Sep 29 '19 07:09 chenhaiq

I feel like a software archeologist going back and digging into this old information!

The very first efforts to scope out the functionality required for implementing L3DSR was done by a different group that I wasn't part of. Before contacting me in May 2008, they had already concluded that the NAT approach was not the route to go because of concerns over it being too CPU and/or memory intensive for our Yahoo! production workloads, and it being a possible DoS (denial of service) vector. That's why they brought me in to do the iptables module work, because of my kernel experience.

As for why your item 3 doesn't work, I remember toying with the idea a decade ago, but NAT was a monstrosity with a lot of temperamental quirks trying to push it to do something the designers didn't intend. I didn't have that good a grasp at all it does to the networking stack to make it work, but I have vague recollections of it being "too smart" and helpful holding on to and monitoring too much networking state information for repurposing it. Did you try testing your item 3 with just TCP traffic, or did you try with UDP or ICMP? If NAT works with the latter, then you know it's due to it not seeing the reverse TCP traffic that went straight to the client thinking it got lost.

With doing item 2, have you done any latency or throughput performance testing or load testing comparing 1 with 2 yet?

qbarnes avatar Sep 29 '19 18:09 qbarnes

I have tested item 3 with ICMP. iptables NAT does not work either.

PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data.
64 bytes from 192.168.0.3: icmp_seq=1 ttl=64 time=0.575 ms

I have not tested performance yet. I actually learned the idea of l3dsr from the implementation in fd.io VPP. It is a very good idea to use DSCP instead of overlay tunnel.

chenhaiq avatar Oct 12 '19 14:10 chenhaiq

I tried in ubuntu 1804+ kernel 4.15. There are 3 combinations:

  1. iptables -t mangle -A INPUT -m dscp --dscp 1 -j DADDR --set-daddr=192.168.0.2 works;
  2. tc filter add dev eth0 parent ffff: protocol ip prio 1 u32 match u32 0x00040000 0x00ff0000 at 0 action nat ingress 192.168.0.3 192.168.0.2 works exactly the same with Linux module only works with ip_conntrack loaded #1 ;
  3. iptables -t nat -A PREROUTING -m dscp --dscp 1 -j DNAT --to-destination 192.168.0.2 does not work. I think this is why you wrote an iptables plugin.

Do you know why iptables nat does not work in this case? I can see that the destination address was changed from iptables log, but the application still responds real server ip address.

Can you please let know how second step works ..Response will be appreciated

svootukuru21 avatar Feb 07 '24 09:02 svootukuru21