kilo
kilo copied to clipboard
Lean iptables updates
Why
We were repeatedly expecting connectivity issues when scaling larger clusters up/down and believe it's related to the way kilo applies changes to the iptables when nodes are added/removed: in many cases that resulted in many rules (or even entire chains) being dropped and recreated during scaling.
Idea
After taking a closer look at the iptables rules created by kilo we figured we have identified two different classes of the same:
- some rules need to be applied strictly in order and always at the end of a chain (e.g. the
FORWARD
toKILO-IPIP
andDROP
rule in theINPUT
chain - also e.g. theMASQUERADE
rule at the end of theKILO-NAT
chain) - other rules don't have a strict ordering requirement (apart from having to be in front of the rules that need to be at the end of a chain where applicable)
How
So we started preparing this patch that puts the different rules in separate slices and
- first applies the order-sensitive/append-only rules using the existing mechanics
- then inserts the 2nd class of rules each one at the beginning of the respective chain if they're not already there - regardless of order
- removes any rules of the 2nd class that are not required anymore
chain rules are treated as order-sensitive/append-only which does not matter a lot beyond being applied in the first wave of changes
Results
That way we were able to accomplish the following:
- the first class of rules gets only applied once because they don't change after the first reconciliation (it's only the chains and the static rules at the end of some chains)
- for updates of the 2nd class of rules (they tend to be per ip/subnet) already existing rules never get dropped unless they need to be removed
We were able to confirm that this reduces the number of mutating iptables operations after the initial iptables configuration to the absolute minimum required.
In general I think this is an important priority for Kilo. Another way I've considered solving this in the past is by using ipsets and atomically swapping out the IP addresses in the sets so that Kilo does not have to change rules so often. This means that whenever new Nodes or Peers join a cluster, this would instead require an "instantaneous" ipset operation rather than changing iptables rules. This would bring other performance benefits with it as the evaluation of a firewall rule containing ipset is faster than evaluating many independent iptables rules.
In general I think this is an important priority for Kilo. Another way I've considered solving this in the past is by using ipsets and atomically swapping out the IP addresses in the sets so that Kilo does not have to change rules so often. This means that whenever new Nodes or Peers join a cluster, this would instead require an "instantaneous" ipset operation rather than changing iptables rules. This would bring other performance benefits with it as the evaluation of a firewall rule containing ipset is faster than evaluating many independent iptables rules.
I do really like the sound of that approach. However, how would you feel about us starting with the implementation in this PR, so that our immediate problem can be solved (we verified that it does indeed solve our problems on one of our test clusters 😁 ), and then have a stab at doing a PR using ipsets at some point later? 😬
The idea of using ipsets sounds great! I think this PR here might even be a nice step towards that direction by introducing the distinction between the two iptables rules classes:
- the append rules seem to be the catch-all kind of rules that have to be in place statically and I don't think ipsets make much sense here?
- the prepend rules seem to be the ip-/subnet-specific ones that would benefit from a move to ipsets
The PR at hand already introduces a carrier datastructure RuleSet
to carry more semantic information than a []Rule
slice - which sounds like a great start to later put more semantics in for ipsets - while in this iteration still relying on the proven approach of using iptables.
So all in all I think this PR could even be considered as a nice evolutionary step towards an ipsets-based implementation.
WDYT?
We just ran a small test against a cluster running this PR (plus the metrics PR for observability) and this is what it looks like in grafana :heart_eyes:
The two iptables related charts display the following metrics:
-
sum(rate(kilo_iptables_operations_total[5m])) by (operation)
-
sum(rate(kilo_iptables_operations_total{operation!~"(List|Exists|ListChains)"}[5m])) by (chain)
We also didn't observe any disconnects during autoscaling as far as we can tell.
@dajudge what's the status of this PR? Is it ready for a final review so we can merge and maybe include in a Kilo 0.6.0? :heart_eyes_cat:
@dajudge what's the status of this PR? Is it ready for a final review so we can merge and maybe include in a Kilo 0.6.0? heart_eyes_cat
Yes. We've been running patched version of kilo with this PR included since July.
We're currently running it on over half a dozen kubernetes clusters with dozens of nodes in total. We have not encountered any issues and it's even playing nicely with kube-router
ever since (we had severe problems with that before).
From our perspective this PR is ready to go! :heart_eyes: :rocket: :+1:
@squat <- any chance of getting this merged and into a release? 😬 🙏 Anything we can do to help move this forward?
@clive-jevons thanks for your patience! If you can help resolve the merge conflict on the PR then I would be very happy to merge this bad boy!
rebased / merge conflicts resolved.
running it on a test cluster now.
looks good at first glance. :+1:
thanks @dajudge