cluster-api-provider-openstack
cluster-api-provider-openstack copied to clipboard
✨ proposal: new API for managed Security Groups
What this PR does / why we need it:
KEP to enhance Security Group API.
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Related #1752
Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!
| Name | Link |
|---|---|
| Latest commit | 3992c240eef52d992b0a6b8034df54e49d3dedec |
| Latest deploy log | https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/65dc8fdba06ca3000843cf27 |
| Deploy Preview | https://deploy-preview-1756--kubernetes-sigs-cluster-api-openstack.netlify.app/print |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: EmilienM Once this PR has been reviewed and has the lgtm label, please assign mdbooth for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/cc mdbooth
/cc jichenjc
@mkjpryor I'll like your feedback on this one when time permits for you. Thanks!
@EmilienM
Do you think that this should be consistent with whatever we do for server groups? Personally, I favour new CRDs for both, i.e. OpenStackSecurityGroup and OpenStackServerGroup CRDs.
I worry about us cluttering up the spec of the OpenStackCluster with too much complicated stuff like this. IMHO we should start to establish a pattern of having additional CRDs for resources that don't have a direct correspondence with a cluster/machine.
For example, a port is something that I don't mind having in OpenStackMachine.spec, as it is clearly related to the machine.
@EmilienM
Do you think that this should be consistent with whatever we do for server groups? Personally, I favour new CRDs for both, i.e.
OpenStackSecurityGroupandOpenStackServerGroupCRDs.I worry about us cluttering up the spec of the
OpenStackClusterwith too much complicated stuff like this. IMHO we should start to establish a pattern of having additional CRDs for resources that don't have a direct correspondence with a cluster/machine.For example, a port is something that I don't mind having in
OpenStackMachine.spec, as it is clearly related to the machine.
To me there is a relation between Security Groups and Machines, as we want to update a Machine if a security group is added to the control plane machines. Example given: if we move the Calico rules in their own Security Group, we'll need a migration path where we remove the rules from the control plane security group and then create a new one with Calico rules and add that group to the machines. This require the controller to update the machine with the new SG. Do we want to do that in the OpenStackMachine (or Cluster) controller or via a new controller dedicated to security groups but that can also update a Machine?
There is an important design decision to be made here - how do we deal with users messing with SGs managed by CAPO through OpenStack API.
My idea would be to adopt K8s model here. The controller should periodically monitor SG rules of the SGs it's managing and SG associations and reconcile them if user made any changes. I think this is the only sane way to do it, otherwise we allow users to modify SGs on their own and we never know if the SG that's attached is to be kept or removed during e.g. upgrade.
I think we all agree; the question remains whether we have a new CRD (+ controller) dedicated to Security Groups lifecycle or continue to use OpenStackCluster.
It just feels odd to me that all the security groups are defined in the OpenStackCluster object and the OpenStackMachine object says "I want to use security group X from the cluster object", vs having a new OpenStackSecurityGroup CRD 🤷♂️ Same for server groups when we do that thing.
I do agree that it probably makes the conversion more complicated.
Happy to be overruled though.
It just feels odd to me that all the security groups are defined in the OpenStackCluster object and the OpenStackMachine object says "I want to use security group X from the cluster object", vs having a new OpenStackSecurityGroup CRD 🤷♂️ Same for server groups when we do that thing.
that's exactly what I asked above, and I Think maybe we can check comments from @EmilienM
For now, let's try without a new CRD and see how it goes. I'll keep this comment open as it's still relevant in the last KEP's iteration.
so maybe we can check https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1765/files first
As discussed at the last office hours, I think this is the biggest single API stabilisation task we have for v1beta1. Specifically, we need to remove the hard-coded Calico rules, but allow:
- CNI rules to be used in managed security groups
- Safely upgrading clusters which are currently using the hard-coded Calico rules
I would very much like to get to v1beta1 asap, so to make that simpler I'd like to propose reducing the scope of the change here as much as possible.
managedSecurityGroups:
# Enable the management of security groups with the default rules (kubelet, etcd, etc.)
# The default stays false
enabled: true
# Allow to extend the default security group rules for the Bastion,
# the Control Plane and the Worker security groups
additionalBastionSecurityGroupRules:
- direction: ingress
ethertype: IPv4
portRangeMin: 1234
portRangeMax: 1234
protocol: tcp
additionalControlPlaneSecurityGroupRules:
- direction: ingress
ethertype: IPv4
portRangeMin: 1234
portRangeMax: 1234
protocol: tcp
additionalWorkerSecurityGroupRules:
- direction: ingress
ethertype: IPv4
portRangeMin: 1234
portRangeMax:1234
protocol: tcp
# Allow to provide rules that will be applied to all nodes
allNodesSecurityGroupRules:
- direction: ingress
ethertype: IPv4
portRangeMin: 1234
portRangeMax: 1234
protocol: tcp
I would remove additionalSecurityGroups for now for 2 reasons. Firstly, in this proposal they are nested under managedSecurityGroups, but they're not managed. I think it would be confusing to write:
managedSecurityGroups:
enabled: false
additionalSecurityGroups:
- ...
Does the user expect these to be created when managed security groups is disabled? additionalSecurityGroups are unmanaged, so I'd have them separate.
Secondly, adding a new additionalSecurityGroups field would be a backwards compatible change, so we can do it in v1beta1. Lets punt on anything we don't strictly need right now.
We discussed having a useLegacyCalicoRules field, but it occurs to me we don't need it: we can just add the Calico rules directly to allNodesSecurityGroupRules during upgrade. This means that the legacy cruft only lives in the legacy API code. We can document what the rules are, and others can add to the documentation the required rules for other CNIs.
In the first instance, we can also remove additional controller complexity by not changing the set of security groups attached to nodes. This means we don't have an upgrade problem, and don't need to be able to dynamically attach and detach security groups. Instead, at least in the first instance, we can simply replicate rules in allNodesSecurityGroupRules to control-plane and workers.
Lastly, we make extensive use of RemoteGroupID in our current rules. We need to allow that to be expressed in the new API, but it obviously can't take the UUID of a managed security group. I suggest that in the first instance we add a remoteManagedGroups field to security group rules which references one of bastion, control-plane, or worker. As a fully worked example, I would expect an upgraded configuration containing upgraded Calico rules to be:
managedSecurityGroups:
enabled: true
allNodesSecurityGroupRules:
- description: BGP (calico)
direction: ingress
ethertype: IPv4
portRangeMin: 179
portRangeMax: 179
protocol: tcp
remoteManagedGroups:
- control-plane
- worker
- description: IP-in-IP (calico)
direction: ingress
ethertype: IPv4
protocol: 4
remoteManagedGroups:
- control-plane
- worker
These 2 rules would become 4 rules, as we would create a duplicate for each of the remoteManagedGroups. The resulting 4 rules would be added to both control-plane and worker.
I suggest that this change alone could also be the minimum change, and we could add additional<X>SecurityGroupRules later as a backwards-compatible change.
Thank you for that thought through comment! I completely agree!
Cool, I'm working on a change here: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1826 to address what Matt has asked for.
I'll just note one thing here - in complicated envs Neutron has performance issues with remote_group_id as it generates a lot of flows. It's advised to use CIDRs directly if possible. This probably isn't a problem for CAPO scale, but I'm mentioning it here for the record.
Note: this one is not staled, I'll come back to it at some point this year, now that we have a new API for Security Groups.
this is still highly WIP, I need to update it based on our recent work. Please don't review it yet.
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
/close This has been inactive for a very long time and I don't see myself working on that in the coming months unless something changes. Until then I'll close this PR.
@EmilienM: Closed this PR.
In response to this:
/close This has been inactive for a very long time and I don't see myself working on that in the coming months unless something changes. Until then I'll close this PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.