kube-router icon indicating copy to clipboard operation
kube-router copied to clipboard

fix(nsc): TCPMSS rules are created per tcp service

Open rkojedzinszky opened this issue 1 year ago • 3 comments

This allows VIPs to be shared between services.

Fixes #1671

rkojedzinszky avatar May 16 '24 06:05 rkojedzinszky

With this change, a maxseg rule exists for each service:

# nft list chain ip mangle PREROUTING
# Warning: table ip mangle is managed by iptables-nft, do not touch!
table ip mangle {
        chain PREROUTING {
                type filter hook prerouting priority mangle; policy accept;
                ip daddr 192.168.9.1 tcp dport 1883 counter packets 0 bytes 0 meta mark set 0x2e69
                ip daddr 192.168.9.1 tcp flags syn / syn,rst tcp dport 1883 counter packets 0 bytes 0 tcp option maxseg size set 1440
                ip daddr 192.168.9.0 udp dport 53 counter packets 6367 bytes 416739 meta mark set 0x1e8c
                ip daddr 192.168.9.0 tcp dport 53 counter packets 0 bytes 0 meta mark set 0x234
                ip daddr 192.168.9.0 tcp flags syn / syn,rst tcp dport 53 counter packets 0 bytes 0 tcp option maxseg size set 1440
                ip daddr 192.168.9.2 tcp dport 1222 counter packets 0 bytes 0 meta mark set 0xa15
                ip daddr 192.168.9.2 tcp flags syn / syn,rst tcp dport 1222 counter packets 0 bytes 0 tcp option maxseg size set 1440
                ip daddr 192.168.8.192 tcp dport 443 counter packets 0 bytes 0 meta mark set 0x2f16
                ip daddr 192.168.8.192 tcp flags syn / syn,rst tcp dport 443 counter packets 0 bytes 0 tcp option maxseg size set 1440
                ip daddr 192.168.8.192 tcp dport 80 counter packets 0 bytes 0 meta mark set 0x17f1
                ip daddr 192.168.8.192 tcp flags syn / syn,rst tcp dport 80 counter packets 0 bytes 0 tcp option maxseg size set 1440
                ip daddr 192.168.9.88 tcp dport 8082 counter packets 0 bytes 0 meta mark set 0x1598
                ip daddr 192.168.9.88 tcp flags syn / syn,rst tcp dport 8082 counter packets 0 bytes 0 tcp option maxseg size set 1440
                ip daddr 192.168.9.88 tcp dport 8080 counter packets 0 bytes 0 meta mark set 0x18be
                ip daddr 192.168.9.88 tcp flags syn / syn,rst tcp dport 8080 counter packets 0 bytes 0 tcp option maxseg size set 1440
                ip daddr 192.168.9.1 tcp dport 3100 counter packets 0 bytes 0 meta mark set 0x3e6d
                ip daddr 192.168.9.1 tcp flags syn / syn,rst tcp dport 3100 counter packets 0 bytes 0 tcp option maxseg size set 1440
                ip daddr 192.168.9.88 tcp dport 8083 counter packets 0 bytes 0 meta mark set 0x172b
                ip daddr 192.168.9.88 tcp flags syn / syn,rst tcp dport 8083 counter packets 0 bytes 0 tcp option maxseg size set 1440
        }
}

After making service on port 8083 not-ready, the two rules simply get removed:

# nft list chain ip mangle PREROUTING
# Warning: table ip mangle is managed by iptables-nft, do not touch!
table ip mangle {
        chain PREROUTING {
                type filter hook prerouting priority mangle; policy accept;
                ip daddr 192.168.9.1 tcp dport 1883 counter packets 0 bytes 0 meta mark set 0x2e69
                ip daddr 192.168.9.1 tcp flags syn / syn,rst tcp dport 1883 counter packets 0 bytes 0 tcp option maxseg size set 1440
                ip daddr 192.168.9.0 udp dport 53 counter packets 6388 bytes 418115 meta mark set 0x1e8c
                ip daddr 192.168.9.0 tcp dport 53 counter packets 0 bytes 0 meta mark set 0x234
                ip daddr 192.168.9.0 tcp flags syn / syn,rst tcp dport 53 counter packets 0 bytes 0 tcp option maxseg size set 1440
                ip daddr 192.168.9.2 tcp dport 1222 counter packets 0 bytes 0 meta mark set 0xa15
                ip daddr 192.168.9.2 tcp flags syn / syn,rst tcp dport 1222 counter packets 0 bytes 0 tcp option maxseg size set 1440
                ip daddr 192.168.8.192 tcp dport 443 counter packets 0 bytes 0 meta mark set 0x2f16
                ip daddr 192.168.8.192 tcp flags syn / syn,rst tcp dport 443 counter packets 0 bytes 0 tcp option maxseg size set 1440
                ip daddr 192.168.8.192 tcp dport 80 counter packets 0 bytes 0 meta mark set 0x17f1
                ip daddr 192.168.8.192 tcp flags syn / syn,rst tcp dport 80 counter packets 0 bytes 0 tcp option maxseg size set 1440
                ip daddr 192.168.9.88 tcp dport 8082 counter packets 0 bytes 0 meta mark set 0x1598
                ip daddr 192.168.9.88 tcp flags syn / syn,rst tcp dport 8082 counter packets 0 bytes 0 tcp option maxseg size set 1440
                ip daddr 192.168.9.88 tcp dport 8080 counter packets 0 bytes 0 meta mark set 0x18be
                ip daddr 192.168.9.88 tcp flags syn / syn,rst tcp dport 8080 counter packets 0 bytes 0 tcp option maxseg size set 1440
                ip daddr 192.168.9.1 tcp dport 3100 counter packets 0 bytes 0 meta mark set 0x3e6d
                ip daddr 192.168.9.1 tcp flags syn / syn,rst tcp dport 3100 counter packets 0 bytes 0 tcp option maxseg size set 1440
        }
}

rkojedzinszky avatar May 16 '24 13:05 rkojedzinszky

Thanks for raising a PR. This indeed looks pretty minimal, and I don't think it hurts to make the TCPMSS configurations port specific. I just left one clarifying question as a review.

aauren avatar May 18 '24 21:05 aauren

So it turns out that we dont need POSTROUTING rules with destination address specified here, only rules with source address specification, as that is the SYN+ACK packet that will reach the connection initiator. That packet has to have a reduced MSS value. Also, as that packet comes from a PODs own ns, it is placed in PREROUTING instead of POSTROUTING. A more nice solution would be to have this setup in the PODs own network namespace rather than in each node's hostnetwork.

rkojedzinszky avatar May 19 '24 15:05 rkojedzinszky

@aauren did you have time to check/verify the PR?

rkojedzinszky avatar May 23 '24 17:05 rkojedzinszky

@rkojedzinszky - Definitely appreciate you submitting an PR for this and all of the detail added to the PR and issue!

I'll try to find some time today to take a look. However, if I can't find a chance to fully review it, it could be a week or two until I get to it because I've got a busy personal schedule the next 2-ish weeks.

aauren avatar May 24 '24 17:05 aauren

@aauren thanks, I will be patient.

rkojedzinszky avatar May 24 '24 18:05 rkojedzinszky

@rkojedzinszky - I had enough time today to look into this.

I think that it looks good! I agree with your conclusion that it is only necessary to set this via source and PREROUTING instead of POSTROUTING. Additionally, I got a chance to test this on a cluster of my own, and it looks correct from what I can see in tcpdump.

I did make one small change to your branch where I pushed a commit that adds some cleanup code into the logic that will hopefully remove lingering MSS changes from previous versions of kube-router (e.g. before this change). It will unfortunately, only execute when the DSR service is updated or deleted / re-applied, but getting it to run as a one-off is too much logic I think for now. Re-creating or forcing an update to DSR services is at least better than having to reboot the node to remove old iptables rules IMO.

Can you test one more time in your cluster to make sure that it is working with the extra iptables rules I added? If so, let me know and we'll merge it.

aauren avatar May 24 '24 23:05 aauren

@aauren shouldn't that code be moved out of the if protocol == tcpProtocol block?

rkojedzinszky avatar May 25 '24 15:05 rkojedzinszky

@aauren I am running kube-router without your patch, but, I've rebooted all my nodes, so I have no rules left there from previous kube-router. Howewer, I like the idea to support the upgrade without needing a reboot. But, as in my previous comment, I suspect that cleanup shall be placed outside of the if block, where it originally was.

rkojedzinszky avatar May 25 '24 15:05 rkojedzinszky

@aauren shouldn't that code be moved out of the if protocol == tcpProtocol block?

That's a good catch. I moved it out of the protocol == tcpProtocol block

aauren avatar May 25 '24 17:05 aauren

@rkojedzinszky please take one more look to make sure that I didn't miss something and if you're good we'll merge.

aauren avatar May 25 '24 18:05 aauren

@aauren I think we also could do the cleanup of old rules during setting up DSR. Then there will be no lingering rules. What do you think?

rkojedzinszky avatar May 26 '24 20:05 rkojedzinszky

This seems fine to me.

Thanks for all of your work on this @rkojedzinszky!

aauren avatar May 27 '24 16:05 aauren

I didn't notice that you had opened your branch against v2.1. Not a huge deal this time, since the changes would have had to have been merged there eventually to cut a new bugfix release. However, in the future, please open all changes against the master tree and I'll take care of the release branches.

For now I have also cherry-picked these changes into master.

aauren avatar May 27 '24 16:05 aauren

@aauren sorry for that, next time I will do that way. Thanks for merging.

rkojedzinszky avatar May 27 '24 16:05 rkojedzinszky

No worries. I appreciate the help and review. :smile:

aauren avatar May 27 '24 17:05 aauren