cluster-api-provider-azure icon indicating copy to clipboard operation
cluster-api-provider-azure copied to clipboard

CAPZ should use Out of Tree `cloud-controller-manager` and Storage Drivers

Open jseely opened this issue 4 years ago • 21 comments

⚠️ Cluster API Azure maintainers can ask to turn an issue-proposal into a CAEP when necessary. This is to be expected for large changes that impact multiple components, breaking changes, or new large features.

Dependencies

  1. Cluster ResourceSet needs to be implemented to properly support this

Goals

  1. CAPZ clusters should be deploy-able using OOT Azure Provider and Storage Drivers
  2. Default should be OOT
  3. Tests need to be updated to test both modes and migration scenario from in-tree to OOT

Non-Goals/Future Work

  1. Implement Cluster ResourceSet

User Story

As an operator I would like to separate the cloud provider integration from the kubernetes binaries and use the newer Storage Drivers and cloud-provider-azure.

Detailed Description

In 2018/2019 Kubernetes started to externalize interactions with the underlying cloud provider to slow down the growth in size of Kubernetes binaries and to decouple the lifecycle and development of Kubernetes from that of the individual cloud provider integrations. https://kubernetes.io/blog/2019/04/17/the-future-of-cloud-providers-in-kubernetes/

/kind proposal

jseely avatar Jun 18 '20 19:06 jseely

@nader-ziada @devigned @CecileRobertMichon @ncdc

jseely avatar Jun 18 '20 19:06 jseely

Have you already seen the doc and template? Might help to distinguish this issue from what's already possible by adding some additional details? ClusterResourceSet is one approach to automate this, but I see you've listed that as a non-goal (and dependency)?

alexeldeib avatar Jun 18 '20 19:06 alexeldeib

The 2nd goal "Default should be OOT" is something we're not necessarily ready for. I think for now we want to support optionally using OOT (without any manual steps, possibly using ClusterResourceSet), but I don't we'll want to move this to be the default right away to align with other Azure provisioning tools. cc @feiskyer @ritazh

See https://github.com/kubernetes/enhancements/issues/667 for current Azure OOT provider status

CecileRobertMichon avatar Jun 18 '20 19:06 CecileRobertMichon

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Oct 08 '20 17:10 fejta-bot

/remove-lifecycle stale

CecileRobertMichon avatar Nov 03 '20 23:11 CecileRobertMichon

/priority important-longterm

CecileRobertMichon avatar Nov 03 '20 23:11 CecileRobertMichon

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Feb 02 '21 00:02 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

fejta-bot avatar Mar 04 '21 01:03 fejta-bot

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

fejta-bot avatar Apr 03 '21 02:04 fejta-bot

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Apr 03 '21 02:04 k8s-ci-robot

/lifecycle frozen

status update:

Cluster ResourceSet needs to be implemented to properly support this

done

CAPZ clusters should be deploy-able using OOT Azure Provider and Storage Drivers

done

Default should be OOT

hold until OOT is fully ready

Tests need to be updated to test both modes and migration scenario from in-tree to OOT

Added tests for OOT (already testing in tree). Not testing migration currently.

Implement Cluster ResourceSet

Done in #1216

CecileRobertMichon avatar Apr 05 '21 16:04 CecileRobertMichon

Default should be OOT

Now that v1.0.0 has been released, we should be able to move forward with this

CecileRobertMichon avatar May 17 '21 18:05 CecileRobertMichon

/assign

CecileRobertMichon avatar May 17 '21 18:05 CecileRobertMichon

/unassign /assign @jackfrancis

CecileRobertMichon avatar Mar 29 '22 21:03 CecileRobertMichon

cc @sonasingh46

shysank avatar Mar 29 '22 22:03 shysank

I have been trying to validate this manually. Especially around kubernetes 1.22 --> 1.23 upgrade paths. The following in-tree components for azure are the points of attention :

AzureDiskCSI Driver AzureFileCSI Driver Cloud-provider-azure

As an effort to extract the cloud provider dependency from Kubernetes, the cloud provider dependent code is moving out from the in-tree Kubernetes. As a result of this, the in-tree csi drivers and cloud providers are moving out of the Kubernetes code base.

From Kubernetes version 1.23 the azureDiskCSIDriver migration is enabled by default. This means that to provision a volume via azureDiskCSIDriver, it will require to install the external azureDiskCSIDriver as the in-tree azureDiskCSIDriver won't work in Kubernetes 1.23 as azureDiskCSIDriver migration is enabled by default.

The in-tree azureFileCSIDriver will continue to work in 1.23 as azureFileCSIDriver migration is not enabled by default in 1.23. If azureFileCSIDriver migration is enabled by user/admin then external azureFileCSIDriver needs to be installed.

Consider the following upgrade paths from v1.22 to v1.23:

Scenario1: Upgrade cluster from Kubernetes version 1.22 to 1.23 without any extra tuning and configuration

  • AzureDiskCSI migration is enabled by default on the upgraded cluster.
  • External azureDiskCSI driver must be installed so that pods using existing volume from in-tree AzureDiskCSI driver of 1.22 will continue to work on upgraded cluster.
  • To create new volume, external azureDiskCSI driver must be installed. One way of installing is via CRS.
  • AzureFileCSI migration is disabled by default.
  • Existing volumes created from in-tree AzureFileCSI driver of 1.22 will continue to work on upgraded cluster.
  • New azure file volumes can be created without any external driver installation.
  • In-tree CCM is enabled by default.

Scenario2: Upgrade cluster from Kubernetes version 1.22 to 1.23 by disabling AzureDiskCSIMigration

  • AzureDiskCSI migration will be disabled the upgraded cluster.
  • Existing volumes created from in-tree AzureDiskCSI driver of 1.22 will continue to work on upgraded cluster via in-tree AzureDiskCSI driver.
  • New azure file volumes can be created without any external driver installation.
  • AzureFileCSI migration is disabled by default.
  • Existing volumes created from in-tree AzureFileCSI driver of 1.22 will continue to work on upgraded cluster via in-tree AzureFileCSI driver.
  • New azure file volumes can be created without any external driver installation.
  • In-tree CCM is enabled by default.

PS: Still validating other scenarios

sonasingh46 avatar Mar 30 '22 08:03 sonasingh46

Scenario3: Upgraded cluster from Kubernetes version 1.22 to 1.23 by enabling external cloud provider

  • The upgrade failed. The new control plane machine did not pass the perFlight checks. Readiness and startup probe failed for the control plane components on the new control plane machine that came up.

  • To fix this, we may need to enable the external volume plugin. ( WIP )

sonasingh46 avatar Mar 30 '22 09:03 sonasingh46

@jackfrancis and @Jont828, is this something that should land in milestone v1.5, or will it probably hit the next one?

mboersma avatar Jul 21 '22 15:07 mboersma

I'm not too sure, is there a PR open or being worked on for this ATM? Looks like Jack was assigned on it so maybe we can ask him when he's back.

Jont828 avatar Jul 21 '22 22:07 Jont828

I think we can land this in the next milestone

jackfrancis avatar Jul 22 '22 14:07 jackfrancis

/milestone next

mboersma avatar Aug 18 '22 15:08 mboersma

/assign /milestone v1.8

CecileRobertMichon avatar Jan 10 '23 22:01 CecileRobertMichon