cluster-api-provider-azure
cluster-api-provider-azure copied to clipboard
CAPZ should use Out of Tree `cloud-controller-manager` and Storage Drivers
⚠️ Cluster API Azure maintainers can ask to turn an issue-proposal into a CAEP when necessary. This is to be expected for large changes that impact multiple components, breaking changes, or new large features.
Dependencies
- Cluster ResourceSet needs to be implemented to properly support this
Goals
- CAPZ clusters should be deploy-able using OOT Azure Provider and Storage Drivers
- Default should be OOT
- Tests need to be updated to test both modes and migration scenario from in-tree to OOT
Non-Goals/Future Work
- Implement Cluster ResourceSet
User Story
As an operator I would like to separate the cloud provider integration from the kubernetes binaries and use the newer Storage Drivers and cloud-provider-azure
.
Detailed Description
In 2018/2019 Kubernetes started to externalize interactions with the underlying cloud provider to slow down the growth in size of Kubernetes binaries and to decouple the lifecycle and development of Kubernetes from that of the individual cloud provider integrations. https://kubernetes.io/blog/2019/04/17/the-future-of-cloud-providers-in-kubernetes/
/kind proposal
@nader-ziada @devigned @CecileRobertMichon @ncdc
Have you already seen the doc and template? Might help to distinguish this issue from what's already possible by adding some additional details? ClusterResourceSet is one approach to automate this, but I see you've listed that as a non-goal (and dependency)?
The 2nd goal "Default should be OOT" is something we're not necessarily ready for. I think for now we want to support optionally using OOT (without any manual steps, possibly using ClusterResourceSet), but I don't we'll want to move this to be the default right away to align with other Azure provisioning tools. cc @feiskyer @ritazh
See https://github.com/kubernetes/enhancements/issues/667 for current Azure OOT provider status
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
/priority important-longterm
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-contributor-experience at kubernetes/community. /close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity. Reopen the issue with
/reopen
. Mark the issue as fresh with/remove-lifecycle rotten
.Send feedback to sig-contributor-experience at kubernetes/community. /close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/lifecycle frozen
status update:
Cluster ResourceSet needs to be implemented to properly support this
done
CAPZ clusters should be deploy-able using OOT Azure Provider and Storage Drivers
done
Default should be OOT
hold until OOT is fully ready
Tests need to be updated to test both modes and migration scenario from in-tree to OOT
Added tests for OOT (already testing in tree). Not testing migration currently.
Implement Cluster ResourceSet
Done in #1216
Default should be OOT
Now that v1.0.0 has been released, we should be able to move forward with this
/assign
/unassign /assign @jackfrancis
cc @sonasingh46
I have been trying to validate this manually. Especially around kubernetes 1.22 --> 1.23 upgrade paths. The following in-tree components for azure are the points of attention :
AzureDiskCSI Driver AzureFileCSI Driver Cloud-provider-azure
As an effort to extract the cloud provider dependency from Kubernetes, the cloud provider dependent code is moving out from the in-tree Kubernetes. As a result of this, the in-tree csi drivers and cloud providers are moving out of the Kubernetes code base.
From Kubernetes version 1.23 the azureDiskCSIDriver
migration is enabled by default. This means that to provision a volume via azureDiskCSIDriver, it will require to install the external azureDiskCSIDriver
as the in-tree azureDiskCSIDriver won't work in Kubernetes 1.23 as azureDiskCSIDriver migration is enabled by default.
The in-tree azureFileCSIDriver will continue to work in 1.23 as azureFileCSIDriver migration is not enabled by default in 1.23. If azureFileCSIDriver migration is enabled by user/admin then external azureFileCSIDriver needs to be installed.
Consider the following upgrade paths from v1.22 to v1.23:
Scenario1: Upgrade cluster from Kubernetes version 1.22 to 1.23 without any extra tuning and configuration
- AzureDiskCSI migration is enabled by default on the upgraded cluster.
- External azureDiskCSI driver must be installed so that pods using existing volume from in-tree AzureDiskCSI driver of 1.22 will continue to work on upgraded cluster.
- To create new volume, external azureDiskCSI driver must be installed. One way of installing is via CRS.
- AzureFileCSI migration is disabled by default.
- Existing volumes created from in-tree AzureFileCSI driver of 1.22 will continue to work on upgraded cluster.
- New azure file volumes can be created without any external driver installation.
- In-tree CCM is enabled by default.
Scenario2: Upgrade cluster from Kubernetes version 1.22 to 1.23 by disabling AzureDiskCSIMigration
- AzureDiskCSI migration will be disabled the upgraded cluster.
- Existing volumes created from in-tree AzureDiskCSI driver of 1.22 will continue to work on upgraded cluster via in-tree AzureDiskCSI driver.
- New azure file volumes can be created without any external driver installation.
- AzureFileCSI migration is disabled by default.
- Existing volumes created from in-tree AzureFileCSI driver of 1.22 will continue to work on upgraded cluster via in-tree AzureFileCSI driver.
- New azure file volumes can be created without any external driver installation.
- In-tree CCM is enabled by default.
PS: Still validating other scenarios
Scenario3: Upgraded cluster from Kubernetes version 1.22 to 1.23 by enabling external cloud provider
-
The upgrade failed. The new control plane machine did not pass the
perFlight
checks. Readiness and startup probe failed for the control plane components on the new control plane machine that came up. -
To fix this, we may need to enable the external volume plugin. ( WIP )
@jackfrancis and @Jont828, is this something that should land in milestone v1.5, or will it probably hit the next one?
I'm not too sure, is there a PR open or being worked on for this ATM? Looks like Jack was assigned on it so maybe we can ask him when he's back.
I think we can land this in the next milestone
/milestone next
/assign /milestone v1.8