cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

EKS VPC CNI cannot be disabled because AWS now installs via Helm

Open AndiDog opened this issue 1 year ago • 21 comments
trafficstars

/kind bug

What steps did you take and what happened:

Creating an EKS cluster with a custom CNI (such as Cilium) is currently problematic because CAPA does not correctly remove the automatically preinstalled VPC CNI. That can be seen by aws-node pods running on the cluster.

CAPA has code to delete the AWS VPC CNI resources if AWSManagedControlPlane.spec.vpcCni.disable=true, but only deletes if they're not managed by Helm. I presume that is on purpose so that users of CAPA could deploy the VPC CNI with Helm in their own way.

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/3618d1c1567afe72781059cd9a3498e7ae44b3b5/pkg/cloud/services/awsnode/cni.go#L269-L293

Unfortunately, it seems that AWS introduced a breaking change by switching their own automagic deployment method to Helm, including the relevant labels. This is what a newly-created EKS cluster looks like (VPC CNI not disabled, cluster created by CAPA ~v2.3.0):

$ kubectl get ds -n kube-system aws-node -o yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "1"
  creationTimestamp: "2024-01-18T15:40:42Z"
  generation: 1
  labels:
    app.kubernetes.io/instance: aws-vpc-cni
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: aws-node
    app.kubernetes.io/version: v1.15.1
    helm.sh/chart: aws-vpc-cni-1.15.1
    k8s-app: aws-node
  name: aws-node
  namespace: kube-system
  # [...]
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: aws-node
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: aws-vpc-cni
        app.kubernetes.io/name: aws-node
        k8s-app: aws-node

The deletion code must be fixed. Sadly, AWS does not provide extra labels to denote that the deployment is AWS-managed. And this breaking change even applies to older Kubernetes versions like 1.24.

Related: E2E test wanted to cover this feature (issue)

Environment:

  • Cluster-api-provider-aws version: ~v2.3.0 (fork with some backports)
  • Kubernetes version: (use kubectl version): 1.24

AndiDog avatar Jan 18 '24 15:01 AndiDog