kops Cannot run kops update cluster on AWS with no spec.cloudControllerManager set when AWSEBSCSIDriver is not managed by kops

/kind bug

1. What kops version are you running? The command kops version, will display this information.

1.24.5

Update: issue also happens with 1.26.3

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

1.24.14

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

Make sure spec.cloudControllerManager is not set in the Cluster manifest
Set spec.kubernetesVersion to 1.24.x
Set spec.cloudConfig.awsEBSCSIDriver.enabled: false
kops replace -f manifest.yaml
kops update cluster k8s.local --target=terraform

5. What happened after the commands executed?

Error: completed cluster failed validation: spec.externalCloudControllerManager: Forbidden: AWS external CCM cannot be used without enabling spec.cloudConfig.AWSEBSCSIDriver.

6. What did you expect to happen?

The command to execute successfully, since I am not setting spec.cloudControllerManager at all. awsEBSCSIDriver.enabled is set to false because I want to install and manage it outside of kops.

**7. Please provide your cluster manifest.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2022-03-01T19:30:28Z"
  generation: 1
  name: <redacted>
spec:
  additionalPolicies:
    master: |
      <redacted>
    node: |
      <redacted>
  api:
    loadBalancer:
      class: Classic
      idleTimeoutSeconds: 1800
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudConfig:
    awsEBSCSIDriver:
      enabled: false
    manageStorageClasses: false
  cloudProvider: aws
  configBase: <redacted>
  containerRuntime: containerd
  containerd:
    configOverride: |
      version = 2
      [plugins]
        [plugins."io.containerd.grpc.v1.cri"]
          [plugins."io.containerd.grpc.v1.cri".containerd]
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
              [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
                runtime_type = "io.containerd.runc.v2"
                [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
                  SystemdCgroup = true
  dnsZone: <redacted>
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-eu-west-1a
      name: a
    - encryptedVolume: true
      instanceGroup: master-eu-west-1b
      name: b
    - encryptedVolume: true
      instanceGroup: master-eu-west-1c
      name: c
    manager:
      env:
      - name: ETCD_LISTEN_METRICS_URLS
        value: http://0.0.0.0:8081
      - name: ETCD_METRICS
        value: extended
      - name: ETCD_MANAGER_HOURLY_BACKUPS_RETENTION
        value: 1d
      - name: ETCD_MANAGER_DAILY_BACKUPS_RETENTION
        value: 30d
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-eu-west-1a
      name: a
    - encryptedVolume: true
      instanceGroup: master-eu-west-1b
      name: b
    - encryptedVolume: true
      instanceGroup: master-eu-west-1c
      name: c
    manager:
      env:
      - name: ETCD_LISTEN_METRICS_URLS
        value: http://0.0.0.0:8082
      - name: ETCD_METRICS
        value: extended
      - name: ETCD_MANAGER_HOURLY_BACKUPS_RETENTION
        value: 1d
      - name: ETCD_MANAGER_DAILY_BACKUPS_RETENTION
        value: 7d
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
    serviceAccountExternalPermissions:
    - aws:
        policyARNs:
        - <redacted>
      name: cert-manager
      namespace: cert-manager
    - aws:
        policyARNs:
        - <redacted>
      name: cluster-autoscaler
      namespace: kube-system
    - aws:
        policyARNs:
        - <redacted>
      name: external-dns
      namespace: infra
    useServiceAccountExternalPermissions: true
  kubeAPIServer:
    auditLogMaxAge: 5
    auditLogMaxBackups: 1
    auditLogMaxSize: 100
    auditLogPath: /var/log/kube-apiserver-audit.log
    auditPolicyFile: /srv/kubernetes/kube-apiserver/audit.conf
    defaultNotReadyTolerationSeconds: 150
    defaultUnreachableTolerationSeconds: 150
    disableBasicAuth: true
    eventTTL: 6h0m0s
    logFormat: json
  kubeControllerManager:
    featureGates:
      CSIMigrationAWS: "true"
    horizontalPodAutoscalerDownscaleDelay: 3m0s
    horizontalPodAutoscalerSyncPeriod: 15s
    horizontalPodAutoscalerUpscaleDelay: 3m0s
    logFormat: json
  kubeDNS:
    nodeLocalDNS:
      enabled: true
    provider: CoreDNS
  kubeProxy:
    metricsBindAddress: 0.0.0.0
  kubeScheduler:
    logFormat: json
    usePolicyConfigMap: true
  kubelet:
    anonymousAuth: false
    cgroupDriver: systemd
    featureGates:
      CSIMigrationAWS: "true"
    logFormat: json
  kubernetesApiAccess:
  - <redacted>
  kubernetesVersion: 1.24.14
  masterInternalName: <redacted>
  masterPublicName: <redacted>
  networkCIDR: 10.252.0.0/17
  networking:
    canal: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  ntp:
    managed: false
  serviceAccountIssuerDiscovery:
    discoveryStore: <redacted>
    enableAWSOIDCProvider: true
  sshAccess:
  - <redacted>
  subnets:
  - cidr: 10.252.16.0/20
    name: eu-west-1a
    type: Private
    zone: eu-west-1a
  - cidr: 10.252.32.0/20
    name: eu-west-1b
    type: Private
    zone: eu-west-1b
  - cidr: 10.252.48.0/20
    name: eu-west-1c
    type: Private
    zone: eu-west-1c
  - cidr: 10.252.0.0/23
    name: utility-eu-west-1a
    type: Utility
    zone: eu-west-1a
  - cidr: 10.252.2.0/23
    name: utility-eu-west-1b
    type: Utility
    zone: eu-west-1b
  - cidr: 10.252.4.0/23
    name: utility-eu-west-1c
    type: Utility
    zone: eu-west-1c
  topology:
    bastion:
      bastionPublicName: <redacted>
      idleTimeoutSeconds: 1800
    dns:
      type: Public
    masters: private
    nodes: private

Jun 01 '23 14:06 flopib

This was a design decision, for sure not a bug. @olemarkus do you remember why we chose this behaviour?

Jun 02 '23 08:06 hakman

EBS CSI driver didn't support manual install in 1.24. It was added in 1.25 though: https://kops.sigs.k8s.io/addons/#self-managed-aws-ebs-csi-driver

Jun 06 '23 19:06 olemarkus

Right, the addition of support for self-managed EBS CSI driver solves the issue indeed, thanks!

Of course, I still get the same error if awsEBSCSIDriver.enabled is left set to false, although that configuration makes less sense in my case now that there is a separate managed parameter.

The error message also misled me into thinking that the external CCM was an opt-in setup controlled by the presence (or absence) of spec.cloudControllerManager in the cluster manifest.

Jun 07 '23 20:06 flopib

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 22 '24 02:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 21 '24 02:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Mar 22 '24 03:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 22 '24 03:03 k8s-ci-robot