security groups: Do not use conntrack when it is not required
Description
This PR changes the behavior of Security Groups to disable connection tracking when it is not needed. The idea is that the VM that have "allow all" rule can have as many connections as they want without straining the host system. This change may be benefitial for VPS hosters, where the VM behavior is not under control of the servers administrator.
The list of changes:
Introduced two new ipsets: cs_notrack for IPv4 and cs_notrack6 for IPv6 that contain the VM IP addresses that do not need to be tracked.
When a security group contains a rule allowing all protocols from 0.0.0.0/0 (IPv4) or ::/0 (IPv6), then all the IPv4 and/or IPv6 addresses of the VM are added to these ipsets.
The following rules are added into iptables table raw chain PREROUTING:
iptables -t raw -A PREROUTING -m set --match-set cs_notrack dst -j NOTRACK
iptables -t raw -A PREROUTING -m set --match-set cs_notrack src -j NOTRACK
ip6tables -t raw -A PREROUTING -m set --match-set cs_notrack6 dst -j NOTRACK
ip6tables -t raw -A PREROUTING -m set --match-set cs_notrack6 src -j NOTRACK
The iptables matchers -m state --state NEW are removed as they are not needed for several reasons:
- they block the allowed traffic if the connection is not tracked
- the rest of the matcher is explicit enough to allow the traffic that was specified in the security group
- the conntrack look up calls can be very expensive on high packet per second rate when the connection tracking table has tens millions of records
The -m state --state ESTABLISHED,RELATED rules are only placed at the end of the VM -def chain, as the last resort rule before the final decision to drop the packet. The goal is to try explicit matchers as much as possible.
The behavior of the -VM chain that contains user-defined rules was modified:
- do not return traffic in the rules, the only possible rule action is ACCEPT. If a packet doesn't match any rules, then it returns back to the
-defchain, where it is checked to belong to an existing connection, otherwise dropped. - the above mentioned
-m state NEWare removed.
Since the VM -def chain is populated with rules for each NIC, and there is no place in inject the final unconditional -j DROP in the code, I had to resort to blocking traffic matching each VM network interface in the end of each set of interface-specific rules
A minor refactoring is done:
- The function
split_ips_by_family()now takes one or more arguments that can be either a;-separated string or any other type that can be parsed by Pythonipaddress.ip_address()method. The function splits;-separated strings when it encounters them, removes the empty elements and '0' literals (they indicate an empty IP address list for some reason). As the result, it returns a tuple containing a list of IPv4 addresses, and IPv6 addresses. Therefore, the function is backwards compatible to the previous behavior. - Some lines of code that were doing the same functionality as the updated
split_ips_by_family(), are removed. - The function
add_to_ipset()uses-!flag that silently ignores addition of a new element if it already exists in the ipset, or its removal if it doesn't exist in the ipset. It will still crash if the requested ipset does not exist. This change makesipset addcalls indempotent.
Types of changes
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Bug fix (non-breaking change which fixes an issue)
- [X] Enhancement (improves an existing feature and functionality)
- [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
- [ ] build/CI
- [ ] test (unit or integration test code)
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
- [ ] Major
- [X] Minor
Bug Severity
- [ ] BLOCKER
- [ ] Critical
- [ ] Major
- [ ] Minor
- [ ] Trivial
Screenshots (if appropriate):
How Has This Been Tested?
Tested:
- starting a VM
- changing the rules on the fly: add/remove "allow all" rule, add more specific rules such as allow a specific TCP port range.
- migrating the VM to a host with these changes deployed
- migrating the VM from the host with these changes deployed to a host with "vanilla" security groups script, and back
- both ingress and egress security groups behavior is tested.
How did you try to break this feature and the system with this change?
- Test with only egress traffic allowed (no ingress rules)
- Test with only ingress traffic allowed (only one egress rule allowing traffic to a non-existing IP address, that makes every other egress traffic dropped)
- Test with more-specific rules, e.g. allow specific ports, or allow only IPv6 traffic
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 16.00%. Comparing base (653b973) to head (f1ff535).
:warning: Report is 804 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #10594 +/- ##
============================================
- Coverage 16.00% 16.00% -0.01%
- Complexity 13104 13105 +1
============================================
Files 5651 5651
Lines 495870 495870
Branches 60049 60049
============================================
- Hits 79370 79361 -9
- Misses 407638 407652 +14
+ Partials 8862 8857 -5
| Flag | Coverage Δ | |
|---|---|---|
| uitests | 4.00% <ø> (ø) |
|
| unittests | 16.84% <ø> (-0.01%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
Well, I tested it somewhat extensively: single ipv4, single ipv4+ipv6, ipv4+ipv6+additional ipv4 and ipv6 ips, multiple NICs, different ipv4+ipv6 rule combinations.. Looks good to me.
However, I wouldn't trust me on this entirely, it is quite hard to not make a mistake with such a complicated script. I could have missed something. So it would be great if someone could test this extensively too.
cc @loth @kriegsmanj
Moving to 4.23 milestone cc @harikrishna-patnala @DaanHoogland
@blueorangutan package
@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✖️ debian ✔️ suse15. SL-JID 16023
@blueorangutan test
@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests
[SF] Trillian Build Failed (tid-14977)
[SF] Trillian Build Failed (tid-14977)
I see the test had failed here. Is there a way for me to see the details?
[SF] Trillian Build Failed (tid-14977)
I see the test had failed here. Is there a way for me to see the details?
this is not a problem with the tests but a capacity problem in the backend lab. I restarted the test job.