containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[EKS] [announcement]: Temporary rollback of enforcing upgrade insights on update cluster version

Open mikestef9 opened this issue 9 months ago • 11 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Which service(s) is this announcement for? EKS

What are you announcing? Yesterday we launched a new feature in EKS where we now check and enforce the status of EKS upgrade insights as part the UpdateClusterVersion API.

This feature is intended to prevent customers from accidentally upgrading when EKS has identified issues that have strong chance causing impact on upgrade. If you are encountering this new behavior when upgrading, the best course of action is to check EKS upgrade insights for the findings marked as ERROR. EKS upgrade insights refreshes on a 24 hour basis, and successful remediation will result in PASSING status after the next refresh (except in the case of deprecated API checks, see #2569). Then you can upgrade without needing the force flag.

While we launched with support in AWS owned tools like aws cli and CloudFormation, we recognize many EKS customers use 3rd party management tools (such as Terraform) which have not yet been updated, and these users cannot easily pass the force flag.

Given this, we have decided to temporarily rollback this feature (the --force flag will still exist, but be treated as a no-op) and give time for community tools to catch up releasing support for the new force flag. We are aiming to have this change completed by end day. We will update this issue again once the roll back is completed, and we will provide further updates on when we will re-roll forward the feature once we feel enough 3rd party tools have been updated with support for the force flag.

mikestef9 avatar Mar 28 '25 17:03 mikestef9

Confirming that the rollback completed late on Friday. Will provide updates in the future on rolling forward once enough ecosystem tools have been updated with support for the force flag.

mikestef9 avatar Mar 31 '25 09:03 mikestef9

@mikestef9, thanks for the detailed update and for rolling back the enforcement to give third-party tools time to catch up — really appreciate the responsiveness.

One improvement that could help reduce friction during upgrades is shortening the refresh interval for EKS Upgrade Insights as well. The current 24-hour cycle can slow down remediation and delay critical upgrade workflows, especially in production environments. A shorter interval (maybe 1h) or even a manual refresh option would make it much easier to validate fixes in a timely manner.

Also, a quick question: Why do the insights fail due to the requirement to update EKS-managed add-ons to the latest version supported by the current cluster version before upgrading the control plane to the next version? This was the issue we encountered during the upgrade — we had pinned versions for the add-ons (kube-proxy, CoreDNS, and VPC CNI), and these were reported as errors.

bmihaescu avatar Apr 02 '25 14:04 bmihaescu

First point, we are strongly considering adding a refresh API before rolling this forward. There will be some kind of rate limiting to prevent abuse, but we understand the suboptimal UX of waiting 24 hours. Although that doesn't help with deprecated API usage insight checks, see my comment here.

On 2nd point, Upgrade Insights use the compatibilities section of the DescribeAddonVersions metadata API to determine addon compatibility with Kubernetes minor versions. If a currently installed addon version is not compatible with a future Kubernetes version, then this is reported as an ERROR.

mikestef9 avatar Apr 02 '25 15:04 mikestef9

Thanks for the clarification!

Please strongly add that refresh API — let’s leave “considering” behind 😉.

Regarding the add-on version check: from what you're saying, it seems that to avoid upgrade blockers, I need to stop pinning add-on versions and instead always use the default/latest version supported by the current EKS version, as defined by the metadata API. That’s understandable, but it introduces a side effect: my production environments could end up running different minor versions of add-ons across clusters, depending on when they were last updated. That’s not necessarily a major issue, but it’s definitely not ideal from a consistency or maintenance perspective.

Also, it’s still not entirely clear what kind of differences exist between minor versions of these add-ons (e.g., kube-proxy, CoreDNS, VPC CNI) that justify making them strictly enforced as part of the upgrade validation. Would appreciate a bit more context there if possible.

bmihaescu avatar Apr 02 '25 17:04 bmihaescu

Not sure I totally follow that. There are many cases in the describe version addon api compatibility matrix where older (not latest/default) versions are still compatible with newer Kubernetes versions. You can and should pin addon versions. We don't even support auto updates of addons (yet).

mikestef9 avatar Apr 02 '25 20:04 mikestef9

First point, we are strongly considering adding a refresh API before rolling this forward. There will be some kind of rate limiting to prevent abuse, but we understand the suboptimal UX of waiting 24 hours.

Yes, my team would also benefit from being able to trigger a refresh. Not having it slows us down - the 24h+ feedback cycle to changes we make is impactful.

dmacbride-ep avatar Apr 02 '25 20:04 dmacbride-ep

Upgrade Insights use the compatibilities section of the DescribeAddonVersions metadata API to determine addon compatibility with Kubernetes minor versions

it’s still not entirely clear what kind of differences exist between minor versions of these add-ons (e.g., kube-proxy, CoreDNS, VPC CNI) that justify making them strictly enforced as part of the upgrade validation. Would appreciate a bit more context there if possible.

As an example, our team was surprised that CoreDNS version 1.11.3-eksbuild.1 seemed insufficient, and that we needed to update to 1.11.4-eksbuild.2 to pass the Upgrade Insights check. I recall that at one time Amazon documented 1.11.3-eksbuild.1 as being either suitable or recommended for EKS 1.31, and that's what we implemented when we updated to 1.31. I don't believe that CoreDNS 1.11.3 is actually incompatible with Kubernetes 1.32 even though it's not the version that Amazon documents as the latest.

Maybe the issue is with the behaviour of the DescribeAddonVersions API?

dmacbride-ep avatar Apr 02 '25 21:04 dmacbride-ep

Thanks for the rollback guys! <3

We were caught by this a little off-guard. In our case, we were blocked by kube-proxy. We don't even use that, we have cilium with kubeProxyReplacement. Yet we have the kube-proxy daemonset (with a nodeSelector, so 0 replicas), as it comes with a fresh EKS (we have 0 addons) and we've been unsure if there some magic in the EKS background that works with it. Given that you check the versions and we were able to fix the check by bumping the daemonset's image version, we found kind of proof of such background activities. We now have the decision to either start maintaining an app that we don't use, or we wait for the terraform provider and use the force flag. But the force flag is for skipping all blockers, and I think there are good reasons to not ignore all of them blindly.

From our perspective, it would be great:

  • to have more control over which Upgrade Insights checks are respected or skipped (force flag)
  • to have more detailed docs per check: What exactly is the requirement and how to mitigate them (e.g. is it ok to remove the daemonset)

mikel-jason avatar Apr 03 '25 06:04 mikel-jason

You should just remove the kube-proxy addon from your cluster. They are installed unmanaged by default for historical and backwards compatibility reasons. But we also launched a feature last year to start a cluster without any (unmanaged) addons installed https://aws.amazon.com/about-aws/whats-new/2024/06/amazon-eks-cluster-creation-flexibility-networking-add-ons/

mikestef9 avatar Apr 03 '25 10:04 mikestef9

FYI

  • The Terraform AWS provider as of version v5.95.0 supports the force_update_version parameter
  • Version v20.36.0 of the terraform-aws-eks module also supports the force_update_version parameter

bryantbiggs avatar Apr 18 '25 16:04 bryantbiggs

You should just remove the kube-proxy addon from your cluster.

In case someone still stumbles across this, here's how to remove unmanaged EKS addons: https://repost.aws/questions/QUNDJ0XuXpRgKHvg7J6JMFxA/how-to-properly-uninstall-eks-unmanaged-add-ons#AN-RJvm0ZnQf6R3dUDDfio7g

$ aws eks create-addon --cluster-name "$CLUSTER_NAME" --addon-name "$ADDON_NAME" --resolve-conflicts none
...
$ aws eks delete-addon --cluster-name "$CLUSTER_NAME" --addon-name "$ADDON_NAME" --no-preserve
...

mikel-jason avatar May 30 '25 06:05 mikel-jason

EKS has rolled out a new feature that allows you to refresh cluster insights on demand. https://aws.amazon.com/about-aws/whats-new/2025/08/amazon-eks-on-demand-insights-refresh/

With this feature you will no longer need to wait for 24 hours for insights to refresh, you can trigger them manually with the new API. Refer to the user guide on how to use the API

amedirr avatar Aug 28 '25 02:08 amedirr