cluster-api-provider-openstack ✨ proposal: new API for managed Security Groups

What this PR does / why we need it:

KEP to enhance Security Group API.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Related #1752

Nov 22 '23 22:11 EmilienM

Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name	Link
Latest commit	3992c240eef52d992b0a6b8034df54e49d3dedec
Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/65dc8fdba06ca3000843cf27
Deploy Preview	https://deploy-preview-1756--kubernetes-sigs-cluster-api-openstack.netlify.app/print
Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Nov 22 '23 22:11 netlify[bot]

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: EmilienM Once this PR has been reviewed and has the lgtm label, please assign mdbooth for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Nov 22 '23 22:11 k8s-ci-robot

/cc mdbooth

Nov 22 '23 22:11 EmilienM

/cc jichenjc

Nov 22 '23 22:11 EmilienM

@mkjpryor I'll like your feedback on this one when time permits for you. Thanks!

Nov 28 '23 16:11 EmilienM

@EmilienM

Do you think that this should be consistent with whatever we do for server groups? Personally, I favour new CRDs for both, i.e. OpenStackSecurityGroup and OpenStackServerGroup CRDs.

I worry about us cluttering up the spec of the OpenStackCluster with too much complicated stuff like this. IMHO we should start to establish a pattern of having additional CRDs for resources that don't have a direct correspondence with a cluster/machine.

For example, a port is something that I don't mind having in OpenStackMachine.spec, as it is clearly related to the machine.

Nov 28 '23 17:11 mkjpryor

@EmilienM

Do you think that this should be consistent with whatever we do for server groups? Personally, I favour new CRDs for both, i.e. OpenStackSecurityGroup and OpenStackServerGroup CRDs.

I worry about us cluttering up the spec of the OpenStackCluster with too much complicated stuff like this. IMHO we should start to establish a pattern of having additional CRDs for resources that don't have a direct correspondence with a cluster/machine.

For example, a port is something that I don't mind having in OpenStackMachine.spec, as it is clearly related to the machine.

To me there is a relation between Security Groups and Machines, as we want to update a Machine if a security group is added to the control plane machines. Example given: if we move the Calico rules in their own Security Group, we'll need a migration path where we remove the rules from the control plane security group and then create a new one with Calico rules and add that group to the machines. This require the controller to update the machine with the new SG. Do we want to do that in the OpenStackMachine (or Cluster) controller or via a new controller dedicated to security groups but that can also update a Machine?

Nov 29 '23 01:11 EmilienM

There is an important design decision to be made here - how do we deal with users messing with SGs managed by CAPO through OpenStack API.

My idea would be to adopt K8s model here. The controller should periodically monitor SG rules of the SGs it's managing and SG associations and reconcile them if user made any changes. I think this is the only sane way to do it, otherwise we allow users to modify SGs on their own and we never know if the SG that's attached is to be kept or removed during e.g. upgrade.

I think we all agree; the question remains whether we have a new CRD (+ controller) dedicated to Security Groups lifecycle or continue to use OpenStackCluster.

Nov 29 '23 01:11 EmilienM

It just feels odd to me that all the security groups are defined in the OpenStackCluster object and the OpenStackMachine object says "I want to use security group X from the cluster object", vs having a new OpenStackSecurityGroup CRD 🤷‍♂️ Same for server groups when we do that thing.

I do agree that it probably makes the conversion more complicated.

Happy to be overruled though.

Nov 29 '23 11:11 mkjpryor

It just feels odd to me that all the security groups are defined in the OpenStackCluster object and the OpenStackMachine object says "I want to use security group X from the cluster object", vs having a new OpenStackSecurityGroup CRD 🤷‍♂️ Same for server groups when we do that thing.

that's exactly what I asked above, and I Think maybe we can check comments from @EmilienM

For now, let's try without a new CRD and see how it goes. I'll keep this comment open as it's still relevant in the last KEP's iteration.

so maybe we can check https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1765/files first

Dec 01 '23 01:12 jichenjc

As discussed at the last office hours, I think this is the biggest single API stabilisation task we have for v1beta1. Specifically, we need to remove the hard-coded Calico rules, but allow:

CNI rules to be used in managed security groups
Safely upgrading clusters which are currently using the hard-coded Calico rules

I would very much like to get to v1beta1 asap, so to make that simpler I'd like to propose reducing the scope of the change here as much as possible.

managedSecurityGroups:
  # Enable the management of security groups with the default rules (kubelet, etcd, etc.)
  # The default stays false
  enabled: true
  # Allow to extend the default security group rules for the Bastion,
  # the Control Plane and the Worker security groups
  additionalBastionSecurityGroupRules:
    - direction: ingress
      ethertype: IPv4
      portRangeMin: 1234
      portRangeMax: 1234
      protocol: tcp
  additionalControlPlaneSecurityGroupRules:
    - direction: ingress
      ethertype: IPv4
      portRangeMin: 1234
      portRangeMax: 1234
      protocol: tcp
  additionalWorkerSecurityGroupRules:
    - direction: ingress
      ethertype: IPv4
      portRangeMin: 1234
      portRangeMax:1234
      protocol: tcp
  # Allow to provide rules that will be applied to all nodes
  allNodesSecurityGroupRules:
    - direction: ingress
      ethertype: IPv4
      portRangeMin: 1234
      portRangeMax: 1234
      protocol: tcp

I would remove additionalSecurityGroups for now for 2 reasons. Firstly, in this proposal they are nested under managedSecurityGroups, but they're not managed. I think it would be confusing to write:

managedSecurityGroups:
  enabled: false
  additionalSecurityGroups:
  - ...

Does the user expect these to be created when managed security groups is disabled? additionalSecurityGroups are unmanaged, so I'd have them separate.

Secondly, adding a new additionalSecurityGroups field would be a backwards compatible change, so we can do it in v1beta1. Lets punt on anything we don't strictly need right now.

We discussed having a useLegacyCalicoRules field, but it occurs to me we don't need it: we can just add the Calico rules directly to allNodesSecurityGroupRules during upgrade. This means that the legacy cruft only lives in the legacy API code. We can document what the rules are, and others can add to the documentation the required rules for other CNIs.

In the first instance, we can also remove additional controller complexity by not changing the set of security groups attached to nodes. This means we don't have an upgrade problem, and don't need to be able to dynamically attach and detach security groups. Instead, at least in the first instance, we can simply replicate rules in allNodesSecurityGroupRules to control-plane and workers.

Lastly, we make extensive use of RemoteGroupID in our current rules. We need to allow that to be expressed in the new API, but it obviously can't take the UUID of a managed security group. I suggest that in the first instance we add a remoteManagedGroups field to security group rules which references one of bastion, control-plane, or worker. As a fully worked example, I would expect an upgraded configuration containing upgraded Calico rules to be:

managedSecurityGroups:
  enabled: true
  allNodesSecurityGroupRules:
  - description: BGP (calico)
    direction: ingress
    ethertype: IPv4
    portRangeMin: 179
    portRangeMax: 179
    protocol: tcp
    remoteManagedGroups:
    - control-plane
    - worker
  - description: IP-in-IP (calico)
    direction: ingress
    ethertype: IPv4
    protocol: 4
    remoteManagedGroups:
    - control-plane
    - worker

These 2 rules would become 4 rules, as we would create a duplicate for each of the remoteManagedGroups. The resulting 4 rules would be added to both control-plane and worker.

I suggest that this change alone could also be the minimum change, and we could add additional<X>SecurityGroupRules later as a backwards-compatible change.

Jan 15 '24 17:01 mdbooth

Thank you for that thought through comment! I completely agree!

Jan 16 '24 09:01 lentzi90

Cool, I'm working on a change here: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1826 to address what Matt has asked for.

Jan 16 '24 20:01 EmilienM

I'll just note one thing here - in complicated envs Neutron has performance issues with remote_group_id as it generates a lot of flows. It's advised to use CIDRs directly if possible. This probably isn't a problem for CAPO scale, but I'm mentioning it here for the record.

Jan 19 '24 14:01 dulek

Note: this one is not staled, I'll come back to it at some point this year, now that we have a new API for Security Groups.

Feb 21 '24 16:02 EmilienM

this is still highly WIP, I need to update it based on our recent work. Please don't review it yet.

Feb 23 '24 00:02 EmilienM

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

May 26 '24 13:05 k8s-triage-robot

/remove-lifecycle stale

May 27 '24 12:05 EmilienM

/close This has been inactive for a very long time and I don't see myself working on that in the coming months unless something changes. Until then I'll close this PR.

Aug 20 '24 18:08 EmilienM

@EmilienM: Closed this PR.

In response to this:

/close This has been inactive for a very long time and I don't see myself working on that in the coming months unless something changes. Until then I'll close this PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Aug 20 '24 18:08 k8s-ci-robot

cluster-api-provider-openstack cluster-api-provider-openstack copied to clipboard

✨ proposal: new API for managed Security Groups

✅ Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

cluster-api-provider-openstack
cluster-api-provider-openstack copied to clipboard

Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!