cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

Enhanced AMI lookup logic for EKS

Open rudoi opened this issue 5 years ago • 16 comments

/kind feature

In a sentence

CAPA should look up EKS AMIs for me automatically, but it should not allow a single AWSMachineTemplate or AWSMachinePool to front multiple AMIs.

note: AWSMachinePool PR https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/1860 is still in flight at this time.

Summary

CAPA today looks up AMI IDs before the runInstance call, if there is no specific AMI hardcoded into the spec of the AWSMachine or AWSMachinePool object. With Kubeadm-based clusters, the CAPI group is in charge of how and when new AMIs get published. We have an implicit assumption that asking for 1.16.8 will return the same AMI every time because we published that AMI and we don't publish new AMIs on the same version, per se.

EKS introduces some very interesting new behaviors:

  1. Kubernetes patch version is upgraded behind the scenes automatically for the control plane
  2. EKS clusters are created at and upgraded to minor versions only (I cannot create a 1.16.8 cluster specifically, and I can not tell EKS to upgrade my cluster to 1.16.8 specifically
  3. There may be multiple AMIs published for a single patch version - I specifically call out that there are 2 1.16.8 AMIs with different docker versions

There is a PR open today that looks up the official EKS AMI in the most basic way: https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/1817 The official AWS documentations says to use an SSM parameter lookup:

aws ssm get-parameter --name /aws/service/eks/optimized-ami/1.16/amazon-linux-2/recommended/image_id --query "Parameter.Value" --output text

This is weird, because across the lifecycle of a MachineDeployment, you may end up with multiple patch versions represented among its Machines. Not only that, each patch version may be represented by multiple AMIs, depending on the time of Machine creation. But - this will map nicely to the control plane's automatic patch version upgrades, so there is a positive here that may be worth capturing.

I think, while automatic version upgrades are helpful, having AMIs swap in and out from under you can be really tough from a reliability perspective, especially if we don't at least capture when it happens.

Because of all of the above, I think CAPA should play a larger role in AMI selection.

Assorted thoughts

  • AWSMachineTemplate is currently "immutable" but that doesn't stop CAPA from resolving a new AMI ID each time the template is copied
  • How does Machine/MachineDeployment "version" map into this if EKS doesn't care about patch versions?
  • You can actually look up all the AMIs for a given minor version:
aws ssm get-parameters-by-path --path /aws/service/eks/optimized-ami/1.16/amazon-linux-2 --region us-east-2 --output json

This output does contain references to patch versions. Is this something we should be parsing?

cc @richardcase @randomvariable

rudoi avatar Aug 13 '20 18:08 rudoi

We have an implicit assumption that asking for 1.16.8 will return the same AMI every time because we published that AMI and we don't publish new AMIs on the same version, per se.

I think that's been more out of laziness then explicit, and we could change that when we move the AMI generation from ex-Heptio to CNCF infrastructure.

I think a typical use case would be something like:

  • A CVE is announced for a component of the image that isn't Kubernetes, e.g. containerd
  • Either:
    • A
      • Run your image build pipeline to create a new AMI with the updated non-k8s components
      • Roll your infrastructure over
    • B
      • In place upgrade

I think we can discount it being a problem that there's multiple AMIs per Kubernetes release, but we do want a mechanism to deal with it and enable the infrastructure rollover easily.

The change in Kubernetes patch version is definitely an issue as far as the API goes though.

This output does contain references to patch versions. Is this something we should be parsing? I think so, in the sense that you'd want to match the EKS control plane version?

Maybe therefore there is a case to be made for the AMI to be resolved in the AWSMachineTemplate according to the following filters:

  • Kubernetes version
  • Build revision, if specified, defaulting to newest

https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/138#issuecomment-425997970 was what I originally proposed for the AMI naming convention, I think revisiting the semver format (inclusion of timestamp) may be worthwhile.

This way, if you need roll out a new AMI in a hurry, you can do so, and use MachineDeployments with separate AWSMachineTemplates to do roll out.

randomvariable avatar Aug 14 '20 12:08 randomvariable

I definitely like the idea of resolving the AMI ID on the AWSMachineTemplate. Resolving the AMI ID as part of each AWSMachine is what's going to put us in a mixed-AMI scenario.

I am curious what we should do about the patch versions for EKS. For one, aws eks describe-cluster doesn't return the patch version of a cluster. I think we can probably get away with doing a more in-depth SSM lookup to filter for a patch version, but then you are definitely going out of sync with the controlplane, which is gunna upgrade on its own no matter what. 🤔

rudoi avatar Aug 14 '20 20:08 rudoi

is there anything on cluster that tells you the patch version? or maybe a separate loop that loops over templates that are associated with an EKS Control Plane and adds an EOUTOFDATE condition if it mismatches the CP.

randomvariable avatar Aug 17 '20 09:08 randomvariable

I suppose there's always the /version endpoint...

rudoi avatar Aug 17 '20 14:08 rudoi

there's potentially value in marking out of date deployments in core CAPI, so maybe the first step is to finalise resolution in awsmachinetemplate?

randomvariable avatar Aug 17 '20 16:08 randomvariable

@randomvariable agreed, I think that's the right way to get this started.

rudoi avatar Aug 17 '20 17:08 rudoi

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Feb 01 '21 13:02 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

fejta-bot avatar Mar 03 '21 14:03 fejta-bot

/lifecycle frozen

richardcase avatar Mar 03 '21 14:03 richardcase

We do some lookup of default AMIs on the EKS side. However, we don't look up the GPU optimized AMIS (see #2275)

richardcase avatar Mar 03 '21 14:03 richardcase

/help

richardcase avatar Mar 03 '21 14:03 richardcase

@richardcase: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 03 '21 14:03 k8s-ci-robot

/assign

sedefsavas avatar Jun 28 '21 17:06 sedefsavas

/milestone v0.7.x

sedefsavas avatar Jul 28 '21 09:07 sedefsavas

/priority important-longterm

randomvariable avatar Nov 08 '21 18:11 randomvariable

/remove-lifecycle frozen

richardcase avatar Jul 08 '22 22:07 richardcase

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 23 '22 19:10 k8s-triage-robot