modernisation-platform icon indicating copy to clipboard operation
modernisation-platform copied to clipboard

SPIKE: Patching ECS/EKS nodes

Open davidkelliott opened this issue 2 years ago • 3 comments

User Story

As a modernisation platform engineer I want customers to use the most recent AMIs with their clusters So that they are using up-to-date software

User Type(s)

Analytical Platform users Data Platform users Performance Monitoring Other potential platform customers on MP

Value

Where ECS or EKS use EC2 instances, we need to ensure that they are using the latest recommended versions. We will start with investigating how we find out the latest versions and make users aware of this, then how we make these upgrades at a platform level if needed.

Questions / Assumptions / Hypothesis

Has this already been covered with the new ECS module raised after this issue was created? If so, is it just a question of migrating legacy users across?

Proposal

This story is about finding out where customers are not making use of up-to-date AMI images for ECS/EKS - for example, where they're hard coding the AMI rather than retrieving the latest version with a data call. It's a bit more free-form than that because this is a spike, but that's my interpretation.

Definition of done

  • [ ] identify options for keeping ECS/EKS users up-to-date
  • [ ] present options to team
  • [ ] define how we would do this on an ongoing basis
  • [ ] raise following issues as necessary

Reference

How to write good user stories

davidkelliott avatar Oct 14 '22 09:10 davidkelliott

This issue is stale because it has been open 90 days with no activity.

github-actions[bot] avatar Jan 13 '23 01:01 github-actions[bot]

This issue is stale because it has been open 90 days with no activity.

github-actions[bot] avatar Dec 13 '23 01:12 github-actions[bot]

Following some discussion, we think this story is around the use of up-to-date AMI images for ECS/EKS containers.

dms1981 avatar May 09 '24 09:05 dms1981

I've gone through the code in Modernisation Platform Environments and created a spreadsheet to document the use of EKS/ECS, making note of where hardcoded ami values are being used.

ECS

  • Currently 7/17 environments are using hardcoded AMI instances to host their ECS services.
  • The rest use Fargate (serverless) instances which are managed by AWS so require no maintenance.
  • The https://github.com/ministryofjustice/modernisation-platform-terraform-ecs-cluster module defaults to a serverless approach but is currently only used by delius apps.

EKS

  • AP and DP are using EKS clusters which are built using the community Terraform EKS module
  • I spoke to Jacob who confirmed that they patch their AMIs on a weekly basis. They don't pin to a specific AMI ID but they do pin to an eks_node_version which is optimised to get the latest available AMI for the current K8s version of EKS cluster.

richgreen-moj avatar Jun 04 '24 12:06 richgreen-moj

Here's a blog with some template code for automating the update of EC2 instances in an auto scaling group that is hosting ECS services https://aws.amazon.com/blogs/industries/automate-patching-by-replacing-amazon-ecs-container-instances/ Essentially it looks up the latest version of the ECS-optimised AMI for your desired platform and then updates the launch template with the new value. Care is taken to drain nodes and take them offline one by one to avoid downtime.

richgreen-moj avatar Jun 04 '24 13:06 richgreen-moj

Retrieving latest AMIs:

ECS

The ECS TF module uses a data call to retrieve the latest ECS-optimised AMI image by querying the Systems Manager Parameter Store API. https://github.com/terraform-aws-modules/terraform-aws-ecs/blob/master/examples/ec2-autoscaling/main.tf#L162C3-L165

This is then used to describe the image id for the ECS auto scaling group https://github.com/terraform-aws-modules/terraform-aws-ecs/blob/master/examples/ec2-autoscaling/main.tf#L296

Members could make use of this module or build this in to their code, rather than hard-coding AMI IDs.

Or via SSM parameter store: aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux-2023/recommended --region eu-west-2

EKS

The EKS TF Module can be used with a data call to get, for instance, the latest bottlerocket EKS-optimised image: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/examples/eks_managed_node_group/main.tf#L527-L535

Or via SSM parameter store: aws ssm get-parameter --name /aws/service/bottlerocket/aws-k8s-1.30/x86_64/latest/image_id --region eu-west-2 --query "Parameter.Value" --output text

richgreen-moj avatar Jun 04 '24 13:06 richgreen-moj

Based on my findings of usage of ECS and EKS in across the MP here is a list of options that members could consider to ensure their infrastructure is patched with the latest AMIs:

Options

  1. Use Fargate (serverless) approach so that instance patch management is managed by AWS. (Use the MP module for this)
  • Pros:
    • Easier to maintain (reduces engineer burden)
    • Costs could be less if well optimised
  • Cons:
    • Less customisable - fewer choices of hardware etc.
  1. Use a Terraform data call to retrieve the latest ECS/EKS-optimised AMI image by querying the Systems Manager Parameter Store API (e.g. https://github.com/terraform-aws-modules/terraform-aws-ecs/blob/master/examples/ec2-autoscaling/main.tf#L162C3-L165)
  • Pros:
    • AMIs will be updated to latest versions/patches as they become available when you run your IaC deployments
  • Cons:
    • Care needs to be taken over roll out of the changes into production to ensure no downtime
  1. Reconsider whether workloads would be appropriate for Cloud Platform
  • Pros:
    • Easier to maintain for application owners (CP mange patching)
    • Lower costs?
  • Cons:
    • There may still be valid reasons why these workloads need to be hosted in MP
  1. Make users aware of the latest AMIs as they are released via an updates channel in Slack?
  • Pros:
    • No immediate code changes required for members
    • Members notified when new AMIs are available
  • Cons:
    • Burden on MP to manage a service that notifies users
    • No guarantee users will update their infrastructure as a result to the alerting.

My Recommendation:

Raise a ticket to explore whether options 1/2/3 would be suitable for all of the applications I've identified who are running ECS/EKS with pinned AMI IDs in their code...

  • analytical-platform-compute
  • data-platform-apps-and-tools
  • apex
  • cdpt-ifs
  • cdpt-chaps
  • mlra
  • performance-hub
  • maat
  • tribunals

richgreen-moj avatar Jun 04 '24 14:06 richgreen-moj

@sukeshreddyg suggested that we could write a lambda script that scans the AMIs in use by clusters in member accounts and compares that with the latest versions so that we can alert MP team when they are out of date. I will draft a story to explore this further.

richgreen-moj avatar Jun 07 '24 15:06 richgreen-moj

Stories to write:

  • [x] 1. Contact MP members with hardcoded AMIs to suggest alternative ways to stay up to date - https://github.com/ministryofjustice/modernisation-platform/issues/7188
  • [x] 2. Monitor for outdated ECS/EKS AMIs on the MP and alert the team - https://github.com/ministryofjustice/modernisation-platform/issues/7189

richgreen-moj avatar Jun 07 '24 15:06 richgreen-moj