containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[EKS] EBS CSI Driver Incompatibility Between EKS AutoMode and Managed Node Groups

Open lokeshmdevops opened this issue 10 months ago • 13 comments

Issue Overview:

In an EKS cluster with both AutoMode nodes and Managed Node Groups, we observed a conflict in how nodes register with the EBS CSI driver: • EKS AutoMode nodes automatically register with ebs.csi.eks.amazonaws.com. • Managed Node Groups automatically register with ebs.csi.aws.com. • This mismatch causes PVC provisioning and attachment failures when workloads run across both node types.

Findings:

• EKS AutoMode nodes are tightly coupled with ebs.csi.eks.amazonaws.com, and they do not register with ebs.csi.aws.com. • Managed Node Groups are tightly coupled with ebs.csi.aws.com, and they do not switch to ebs.csi.eks.amazonaws.com. • Mixing AutoMode and Managed nodes in the same cluster leads to PVC attachment failures.

Impact:

• Workloads that rely on EBS volumes fail when scheduled across both node types. • Manual intervention is needed to ensure workloads run on the correct nodes, leading to operational overhead.

Expected Behavior:

• Nodes in an EKS cluster should register consistently with the same CSI driver. • Either both node types should support a unified EBS CSI driver, or there should be a way to configure the CSI driver per cluster.

Requesting Labels: EKS, Amazon Elastic Kubernetes Service, EKS Auto Mode, EKS Networking

lokeshmdevops avatar Feb 24 '25 09:02 lokeshmdevops

Do you have the EKS addon of helm chart for EBS CSI driver installed in the cluster? You need that to manage volumes attached to nodes running in a managed node group or any non Auto Mode compute in the cluster.

mikestef9 avatar Feb 24 '25 16:02 mikestef9

Hi

I'm experiencing the same issue. I do have Amazon EBS CSI Driver v1.39.0 installed as add-on.

Getting the following error when trying to schedule workloads on the EKS Auto mode node with the storage class configured with provisioner: ebs.csi.aws.com:

"Unhandled Error" err="error syncing claim \"<redacted>\": failed to provision volume with StorageClass \"gp3-default\": error generating accessibility requirements: no topology key found on CSINode <redacted>" logger="UnhandledError"

could be the reason: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/fd95d0a4b4774e4b4927c37bd504a1dd3be54162/deploy/kubernetes/base/node.yaml#L25-L34

thank you

andrey-ch-dev avatar Feb 25 '25 23:02 andrey-ch-dev

That's the wrong storage class for provisioning volumes of workloads running on Auto Mode. You need to use

provisioner: ebs.csi.eks.aws.com

mikestef9 avatar Feb 26 '25 17:02 mikestef9

qq, how to use exiting volumes with provisioner: ebs.csi.aws.com which were created before EKS Auto mode was enabled in the cluster?

Currently, cluster has self managed NG and Auto mode nodepools. But cannot really move workloads which are still tied to storage class configured with provisioner: ebs.csi.aws.com

thank you for the response.

andrey-ch-dev avatar Feb 26 '25 18:02 andrey-ch-dev

Do you have the EKS addon of helm chart for EBS CSI driver installed in the cluster? You need that to manage volumes attached to nodes running in a managed node group or any non Auto Mode compute in the cluster.

Yes, we have the EKS add-on installed in the cluster, and it is running fine.

As mentioned in the issue, some dependent microservices (e.g., StatefulSets) need to run on managed node groups, while all other stateless microservices will run on EKS Auto Mode.

We need to use both managed node pools and EKS Auto Mode-based nodes. However, the issue arises because:

Managed node pools use the provisioner ebs.csi.aws.com. Auto Mode nodes register with the provisioner ebs.csi.eks.aws.com. This mismatch is causing the problem.

Thanks for your response! Please let me know if you need any more information.

lokeshmdevops avatar Mar 03 '25 06:03 lokeshmdevops

If the services are running on separate compute (Auto Mode and non Auto Mode), this should be a supported configuration. Are you able to open an AWS support case?

mikestef9 avatar Mar 03 '25 16:03 mikestef9

If the services are running on separate compute (Auto Mode and non Auto Mode), this should be a supported configuration. Are you able to open an AWS support case?

I haven't opened an AWS support case yet, but I can do that if needed. Before that, I wanted to check if there are any known issues or recommended configurations to make EBS CSI work seamlessly across Auto Mode and non-Auto Mode nodes. Any guidance on this would be helpful.

lokesh-mateti avatar Mar 24 '25 06:03 lokesh-mateti

I also ran into this

  1. make a EKS cluster with EBS CSI driver addon & and EKS Auto Mode disabled.
  2. make a statefulset and it provisions a PVC
  3. switch the cluster to EKS auto mode
  4. delete old self-managed nodepools (with expectation that karpenter.sh will provide nodes)
  5. most pods will schedule correctly, but stateful pods that mount a PVC will be stuck in pending.
  6. re-add self-managed nodepool alongside Auto Mode Karpenter to get the stateful pod to schedule.

It'd be nice if someone from AWS could confirm it's a bug, and if it'd get on a roadmap to be fixed.

neoakris avatar Mar 25 '25 15:03 neoakris

This is expected. The OSS EBS CSI driver and managed version through Auto Mode have different provisioner names ebs.csi.aws.com vs ebs.csi.eks.amazonaws.com

To migrate you need to do something like

  1. Change the old PV to have the volume lifecycle policy Retain.
  2. Delete the old PV.
  3. Statically provision the same EBS volume by creating a new PV that the workload can point to that is pre-bound a PVC.
  4. Associate a new PVC to the statically provisioned PV by creating a new PVC with a special volumeName.

We recognize this is less than ideal, and have some better documentation in flight about volume migration, as well as some product improvements to make migration more seamless.

mikestef9 avatar Mar 27 '25 03:03 mikestef9

This is expected. The OSS EBS CSI driver and managed version through Auto Mode have different provisioner names ebs.csi.aws.com vs ebs.csi.eks.amazonaws.com

This does appear to be at odds with the documentation at https://docs.aws.amazon.com/eks/latest/userguide/migrate-auto.html which states "You can install the EBS CSI controller on an Amazon EKS Auto Mode cluster. Use a StorageClass to associate volumes with either the EBS CSI Controller or EKS Auto Mode." which to me suggests that the volumes provisioned using the ebs.csi.aws.com provisioner would continue to function following a switch to EKS Auto Mode so long as the EBS CSI Driver add-on remains enabled.

PeteLawrence avatar Mar 27 '25 09:03 PeteLawrence

Working on docs updates https://github.com/awsdocs/amazon-eks-user-guide/pull/946/files

And again, working on better long term guidance for how to migrate volumes to Auto Mode.

mikestef9 avatar Mar 28 '25 10:03 mikestef9

After doing a deep dive into the EKS docs, I now agree that this is intended/documented behavior.

Extra Helpful EKS Docs: (AWS support helped point these out to me, that I missed from googling)

  • https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html#ebs-csi-considerations
    • The PVC incompatibility is documented on this page
      "Only platform versions created from a storage class using ebs.csi.eks.amazonaws.com as the provisioner can be mounted on nodes created by EKS Auto Mode. Existing platform versions must be migrated to the new storage class using a volume snapshot."
    • I pressed the feedback button on the page to suggest the following bit of improved wording: "Only PVCs created from a storage class using ebs.csi.eks.amazonaws.com as the provisioner can be mounted on nodes created by EKS Auto Mode. Pre-existing PVCs created by the EBS CSI Driver (managed by EKS Add-on), can only be mounted by worker nodes that are not managed by EKS Auto Mode. If you only have EKS Auto Mode nodes, then pods trying to mount Pre-existing PVCs will be stuck in pending status until you can migrate them to a new storage class that references the new EBS Volume Provisioner used by EKS Auto Mode. Kubernetes PVC volume snapshot can be used to migrate PVCs to a new storage class."
  • https://docs.aws.amazon.com/eks/latest/userguide/migrate-auto.html
    "AWS does not support the following migrations":
    • "Migrating volumes from the EBS CSI Controller to EKS Auto Mode Block Storage"
    • I also used feedback form to suggest the following rewording on that page
      "Migrating volumes from the EBS CSI Controller (managed by EKS Add-on) to EBS CSI Controller (managed by EKS Auto Mode), PVCs made with one can't be mounted by the other, because they use 2 different kubernetes volume provisioners."
  • An important side note to anyone migrating, read this carefully especially the migration reference table https://docs.aws.amazon.com/eks/latest/userguide/migrate-auto.html
    It looks to me like the 3 main things that are problematic are:
    1. PVCs may need to be migrated to a new storage class using kubernetes volume snapshots + manual effort.
      (The EBS CSI Driver Addon has a node affinity rule that prevents it from running on Auto Mode provisioned nodes https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/v1.41.0/charts/aws-ebs-csi-driver/values.yaml#L200-L205, I'm guessing this is in order to avoid conflicts with Auto Mode having it's own CSI volume provider baked into auto mode supplied AMIs)
    2. Blue-Green cutover of services of type load balancer, since Auto Mode uses a different loadBalancerClass
    3. Blue-Green cutover of ingress.yaml objects (if AWS LB Controller is used), since Auto Mode uses a different Kubernetes IngressClass than the AWS LB Controller Add-On.
  • The EKS Web GUI section of AWS Console makes it look like a simple toggle, but given how involved / difficult and the number of manual operations involved in a migration to Auto Mode. I'd personally recommend people do a blue-green cutover to a new cluster when trying to turn auto mode on and off.

BTW if anyone's in contact with the AWS team, a feature request that'd be very nice is:

  • In the AWS Web GUI Console of EKS Auto Mode (the spot where you can toggle it on and off), it'd be great to link to the auto mode migration reference docs. (so users are less reliant on google search, which is crap these days and failed to lead me to the above linked docs.)

neoakris avatar Mar 28 '25 18:03 neoakris

If anyone come here after an auto-mode transition, use this: https://github.com/awslabs/eks-auto-mode-ebs-migration-tool

Ready to use script:

NAMESPACE='X'
SOURCE_CLASS='Y'
TARGET_CLASS='Z'
CLUSTER_NAME='C'
DRYRUN='false'
for pvc_name in $(kubectl get pvc --no-headers --namespace $NAMESPACE | grep " $SOURCE_CLASS " | grep ' Bound ' | cut -d ' ' -f 1); do
    kubectl get --namespace $NAMESPACE pvc "$pvc_name" -o yaml > "backup_pvc_${pvc_name}.yaml"
    echo "Created backup for PVC: $pvc_name"

    eks-auto-mode-ebs-migration-tool -cluster-name $CLUSTER_NAME -pvc-name $pvc_name -storageclass $TARGET_CLASS -namespace $NAMESPACE --dry-run=$DRYRUN -yes
done

Fran-Rg avatar Oct 30 '25 10:10 Fran-Rg