kube-router
kube-router copied to clipboard
fix(nsc): TCPMSS rules are created per tcp service
This allows VIPs to be shared between services.
Fixes #1671
With this change, a maxseg rule exists for each service:
# nft list chain ip mangle PREROUTING
# Warning: table ip mangle is managed by iptables-nft, do not touch!
table ip mangle {
chain PREROUTING {
type filter hook prerouting priority mangle; policy accept;
ip daddr 192.168.9.1 tcp dport 1883 counter packets 0 bytes 0 meta mark set 0x2e69
ip daddr 192.168.9.1 tcp flags syn / syn,rst tcp dport 1883 counter packets 0 bytes 0 tcp option maxseg size set 1440
ip daddr 192.168.9.0 udp dport 53 counter packets 6367 bytes 416739 meta mark set 0x1e8c
ip daddr 192.168.9.0 tcp dport 53 counter packets 0 bytes 0 meta mark set 0x234
ip daddr 192.168.9.0 tcp flags syn / syn,rst tcp dport 53 counter packets 0 bytes 0 tcp option maxseg size set 1440
ip daddr 192.168.9.2 tcp dport 1222 counter packets 0 bytes 0 meta mark set 0xa15
ip daddr 192.168.9.2 tcp flags syn / syn,rst tcp dport 1222 counter packets 0 bytes 0 tcp option maxseg size set 1440
ip daddr 192.168.8.192 tcp dport 443 counter packets 0 bytes 0 meta mark set 0x2f16
ip daddr 192.168.8.192 tcp flags syn / syn,rst tcp dport 443 counter packets 0 bytes 0 tcp option maxseg size set 1440
ip daddr 192.168.8.192 tcp dport 80 counter packets 0 bytes 0 meta mark set 0x17f1
ip daddr 192.168.8.192 tcp flags syn / syn,rst tcp dport 80 counter packets 0 bytes 0 tcp option maxseg size set 1440
ip daddr 192.168.9.88 tcp dport 8082 counter packets 0 bytes 0 meta mark set 0x1598
ip daddr 192.168.9.88 tcp flags syn / syn,rst tcp dport 8082 counter packets 0 bytes 0 tcp option maxseg size set 1440
ip daddr 192.168.9.88 tcp dport 8080 counter packets 0 bytes 0 meta mark set 0x18be
ip daddr 192.168.9.88 tcp flags syn / syn,rst tcp dport 8080 counter packets 0 bytes 0 tcp option maxseg size set 1440
ip daddr 192.168.9.1 tcp dport 3100 counter packets 0 bytes 0 meta mark set 0x3e6d
ip daddr 192.168.9.1 tcp flags syn / syn,rst tcp dport 3100 counter packets 0 bytes 0 tcp option maxseg size set 1440
ip daddr 192.168.9.88 tcp dport 8083 counter packets 0 bytes 0 meta mark set 0x172b
ip daddr 192.168.9.88 tcp flags syn / syn,rst tcp dport 8083 counter packets 0 bytes 0 tcp option maxseg size set 1440
}
}
After making service on port 8083 not-ready, the two rules simply get removed:
# nft list chain ip mangle PREROUTING
# Warning: table ip mangle is managed by iptables-nft, do not touch!
table ip mangle {
chain PREROUTING {
type filter hook prerouting priority mangle; policy accept;
ip daddr 192.168.9.1 tcp dport 1883 counter packets 0 bytes 0 meta mark set 0x2e69
ip daddr 192.168.9.1 tcp flags syn / syn,rst tcp dport 1883 counter packets 0 bytes 0 tcp option maxseg size set 1440
ip daddr 192.168.9.0 udp dport 53 counter packets 6388 bytes 418115 meta mark set 0x1e8c
ip daddr 192.168.9.0 tcp dport 53 counter packets 0 bytes 0 meta mark set 0x234
ip daddr 192.168.9.0 tcp flags syn / syn,rst tcp dport 53 counter packets 0 bytes 0 tcp option maxseg size set 1440
ip daddr 192.168.9.2 tcp dport 1222 counter packets 0 bytes 0 meta mark set 0xa15
ip daddr 192.168.9.2 tcp flags syn / syn,rst tcp dport 1222 counter packets 0 bytes 0 tcp option maxseg size set 1440
ip daddr 192.168.8.192 tcp dport 443 counter packets 0 bytes 0 meta mark set 0x2f16
ip daddr 192.168.8.192 tcp flags syn / syn,rst tcp dport 443 counter packets 0 bytes 0 tcp option maxseg size set 1440
ip daddr 192.168.8.192 tcp dport 80 counter packets 0 bytes 0 meta mark set 0x17f1
ip daddr 192.168.8.192 tcp flags syn / syn,rst tcp dport 80 counter packets 0 bytes 0 tcp option maxseg size set 1440
ip daddr 192.168.9.88 tcp dport 8082 counter packets 0 bytes 0 meta mark set 0x1598
ip daddr 192.168.9.88 tcp flags syn / syn,rst tcp dport 8082 counter packets 0 bytes 0 tcp option maxseg size set 1440
ip daddr 192.168.9.88 tcp dport 8080 counter packets 0 bytes 0 meta mark set 0x18be
ip daddr 192.168.9.88 tcp flags syn / syn,rst tcp dport 8080 counter packets 0 bytes 0 tcp option maxseg size set 1440
ip daddr 192.168.9.1 tcp dport 3100 counter packets 0 bytes 0 meta mark set 0x3e6d
ip daddr 192.168.9.1 tcp flags syn / syn,rst tcp dport 3100 counter packets 0 bytes 0 tcp option maxseg size set 1440
}
}
Thanks for raising a PR. This indeed looks pretty minimal, and I don't think it hurts to make the TCPMSS configurations port specific. I just left one clarifying question as a review.
So it turns out that we dont need POSTROUTING rules with destination address specified here, only rules with source address specification, as that is the SYN+ACK packet that will reach the connection initiator. That packet has to have a reduced MSS value. Also, as that packet comes from a PODs own ns, it is placed in PREROUTING instead of POSTROUTING. A more nice solution would be to have this setup in the PODs own network namespace rather than in each node's hostnetwork.
@aauren did you have time to check/verify the PR?
@rkojedzinszky - Definitely appreciate you submitting an PR for this and all of the detail added to the PR and issue!
I'll try to find some time today to take a look. However, if I can't find a chance to fully review it, it could be a week or two until I get to it because I've got a busy personal schedule the next 2-ish weeks.
@aauren thanks, I will be patient.
@rkojedzinszky - I had enough time today to look into this.
I think that it looks good! I agree with your conclusion that it is only necessary to set this via source and PREROUTING instead of POSTROUTING. Additionally, I got a chance to test this on a cluster of my own, and it looks correct from what I can see in tcpdump.
I did make one small change to your branch where I pushed a commit that adds some cleanup code into the logic that will hopefully remove lingering MSS changes from previous versions of kube-router (e.g. before this change). It will unfortunately, only execute when the DSR service is updated or deleted / re-applied, but getting it to run as a one-off is too much logic I think for now. Re-creating or forcing an update to DSR services is at least better than having to reboot the node to remove old iptables rules IMO.
Can you test one more time in your cluster to make sure that it is working with the extra iptables rules I added? If so, let me know and we'll merge it.
@aauren shouldn't that code be moved out of the if protocol == tcpProtocol block?
@aauren I am running kube-router without your patch, but, I've rebooted all my nodes, so I have no rules left there from previous kube-router. Howewer, I like the idea to support the upgrade without needing a reboot. But, as in my previous comment, I suspect that cleanup shall be placed outside of the if block, where it originally was.
@aauren shouldn't that code be moved out of the if protocol == tcpProtocol block?
That's a good catch. I moved it out of the protocol == tcpProtocol block
@rkojedzinszky please take one more look to make sure that I didn't miss something and if you're good we'll merge.
@aauren I think we also could do the cleanup of old rules during setting up DSR. Then there will be no lingering rules. What do you think?
This seems fine to me.
Thanks for all of your work on this @rkojedzinszky!
I didn't notice that you had opened your branch against v2.1. Not a huge deal this time, since the changes would have had to have been merged there eventually to cut a new bugfix release. However, in the future, please open all changes against the master tree and I'll take care of the release branches.
For now I have also cherry-picked these changes into master.
@aauren sorry for that, next time I will do that way. Thanks for merging.
No worries. I appreciate the help and review. :smile: