kilo icon indicating copy to clipboard operation
kilo copied to clipboard

Lean iptables updates

Open dajudge opened this issue 1 year ago • 4 comments

Why

We were repeatedly expecting connectivity issues when scaling larger clusters up/down and believe it's related to the way kilo applies changes to the iptables when nodes are added/removed: in many cases that resulted in many rules (or even entire chains) being dropped and recreated during scaling.

Idea

After taking a closer look at the iptables rules created by kilo we figured we have identified two different classes of the same:

  1. some rules need to be applied strictly in order and always at the end of a chain (e.g. the FORWARD to KILO-IPIP and DROP rule in the INPUT chain - also e.g. the MASQUERADE rule at the end of the KILO-NAT chain)
  2. other rules don't have a strict ordering requirement (apart from having to be in front of the rules that need to be at the end of a chain where applicable)

How

So we started preparing this patch that puts the different rules in separate slices and

  1. first applies the order-sensitive/append-only rules using the existing mechanics
  2. then inserts the 2nd class of rules each one at the beginning of the respective chain if they're not already there - regardless of order
  3. removes any rules of the 2nd class that are not required anymore

chain rules are treated as order-sensitive/append-only which does not matter a lot beyond being applied in the first wave of changes

Results

That way we were able to accomplish the following:

  1. the first class of rules gets only applied once because they don't change after the first reconciliation (it's only the chains and the static rules at the end of some chains)
  2. for updates of the 2nd class of rules (they tend to be per ip/subnet) already existing rules never get dropped unless they need to be removed

We were able to confirm that this reduces the number of mutating iptables operations after the initial iptables configuration to the absolute minimum required.

dajudge avatar Jul 26 '22 12:07 dajudge

In general I think this is an important priority for Kilo. Another way I've considered solving this in the past is by using ipsets and atomically swapping out the IP addresses in the sets so that Kilo does not have to change rules so often. This means that whenever new Nodes or Peers join a cluster, this would instead require an "instantaneous" ipset operation rather than changing iptables rules. This would bring other performance benefits with it as the evaluation of a firewall rule containing ipset is faster than evaluating many independent iptables rules.

squat avatar Jul 26 '22 14:07 squat

In general I think this is an important priority for Kilo. Another way I've considered solving this in the past is by using ipsets and atomically swapping out the IP addresses in the sets so that Kilo does not have to change rules so often. This means that whenever new Nodes or Peers join a cluster, this would instead require an "instantaneous" ipset operation rather than changing iptables rules. This would bring other performance benefits with it as the evaluation of a firewall rule containing ipset is faster than evaluating many independent iptables rules.

I do really like the sound of that approach. However, how would you feel about us starting with the implementation in this PR, so that our immediate problem can be solved (we verified that it does indeed solve our problems on one of our test clusters 😁 ), and then have a stab at doing a PR using ipsets at some point later? 😬

clive-jevons avatar Jul 26 '22 15:07 clive-jevons

The idea of using ipsets sounds great! I think this PR here might even be a nice step towards that direction by introducing the distinction between the two iptables rules classes:

  • the append rules seem to be the catch-all kind of rules that have to be in place statically and I don't think ipsets make much sense here?
  • the prepend rules seem to be the ip-/subnet-specific ones that would benefit from a move to ipsets

The PR at hand already introduces a carrier datastructure RuleSet to carry more semantic information than a []Rule slice - which sounds like a great start to later put more semantics in for ipsets - while in this iteration still relying on the proven approach of using iptables.

So all in all I think this PR could even be considered as a nice evolutionary step towards an ipsets-based implementation.

WDYT?

dajudge avatar Jul 27 '22 09:07 dajudge

We just ran a small test against a cluster running this PR (plus the metrics PR for observability) and this is what it looks like in grafana :heart_eyes:

image

The two iptables related charts display the following metrics:

  • sum(rate(kilo_iptables_operations_total[5m])) by (operation)
  • sum(rate(kilo_iptables_operations_total{operation!~"(List|Exists|ListChains)"}[5m])) by (chain)

We also didn't observe any disconnects during autoscaling as far as we can tell.

dajudge avatar Jul 28 '22 14:07 dajudge

@dajudge what's the status of this PR? Is it ready for a final review so we can merge and maybe include in a Kilo 0.6.0? :heart_eyes_cat:

squat avatar Oct 20 '22 10:10 squat

@dajudge what's the status of this PR? Is it ready for a final review so we can merge and maybe include in a Kilo 0.6.0? heart_eyes_cat

Yes. We've been running patched version of kilo with this PR included since July.

We're currently running it on over half a dozen kubernetes clusters with dozens of nodes in total. We have not encountered any issues and it's even playing nicely with kube-router ever since (we had severe problems with that before).

From our perspective this PR is ready to go! :heart_eyes: :rocket: :+1:

dajudge avatar Oct 20 '22 11:10 dajudge

@squat <- any chance of getting this merged and into a release? 😬 🙏 Anything we can do to help move this forward?

clive-jevons avatar Dec 20 '22 12:12 clive-jevons

@clive-jevons thanks for your patience! If you can help resolve the merge conflict on the PR then I would be very happy to merge this bad boy!

squat avatar Mar 28 '23 06:03 squat

rebased / merge conflicts resolved.

running it on a test cluster now.

looks good at first glance. :+1:

dajudge avatar Mar 28 '23 13:03 dajudge

thanks @dajudge

squat avatar Mar 28 '23 13:03 squat