containers-roadmap Issues with "software version consistency" feature

EDIT: this is related to the "software version consistency" feature launch, see What's New post: https://aws.amazon.com/about-aws/whats-new/2024/07/amazon-ecs-software-version-consistency-containerized-applications/

Summary

since our EC2 upgraded to ecs-agent v1.83.0, images used for containers are with sha digest and not image tag

Description

we started getting different image value for the '{{.Config.Image}}' property using docker inspect in our ECS EC2. we are getting sha digest as the .Config.Image instead of getting the image tag. the task definition contains the correct image tag (and not the digest)

we need the image tag since we rely on that custom tag to understand what was deployed. what can be done?

Expected Behavior

we expect to see the image tag used for the container

Observed Behavior

we get image digest used for the container

Environment Details

curl http://localhost:51678/v1/metadata {"Cluster":"xxxxr","ContainerInstanceArn":"xxx","Version":"Amazon ECS Agent - v1.83.0 (*xxx"}

Jul 02 '24 04:07 gilad-yadgar

same

Jul 02 '24 08:07 dg-nvm

FWIW today I've encountered a production incident after updating to ecs-agent 1.83.0 roughly 2 weeks ago where I saw a subset of our ECS tasks fail to start with:

CannotPullContainerError: failed to tag image '012345678910.dkr.ecr.<region>.amazonaws.com/<repo>:<tag>@sha256:<digest>'

This was a surprising error to see given that the only change on our side that we can attribute to this is our agent version upgrade 🤷 and it feels similar enough to be worth a mention given the digest in the error message.

This seemed to be isolated to a small fraction of our cluster instances (all running 1.83.0) and tasks from the same task revisions yielding the error eventually phased in without intervention.

I've also happened to notice that https://github.com/aws/amazon-ecs-agent/pull/4181 intends to help augment these kinds of errors with some more useful context and made it into agent release 1.84.0 so I'll report back if/when we upgrade and whether or not that yields anything of use 👍

EDIT: didn't touch the 1.84.0 upgrade after seeing this comment

Jul 02 '24 19:07 scott-vh

this has also caused production issues for my org. we use the ImageName value available in the ECS container metadata file at runtime, as we tag our ECR images with the Git commit SHA. this is then used for a variety of things in different services such as sourcing assets, tracking deploys, etc.

since 1.83.0 ImageName is sometimes present as the SHA digest instead of the image ID, which we expected to be within ImageID and not ImageName.

Jul 03 '24 09:07 tomdaly

I still found this error on ecs-agent 1.84.0.

Jul 04 '24 03:07 panny-P

We have production issues with the change too, when the tag is re-used for a new image layer and the old image is deleted.

Jul 05 '24 09:07 mvanholsteijn

I'm also seeing the issue where a newly pushed and tagged "latest" image is being ignored and the agent will only use the older untagged instance. This needs to be fixed ASAP or at least give us a workaround. I'm seeing this behavior on agent 1.83.0. This was not happening on 1.82.1.

Jul 08 '24 18:07 timdarbydotnet

We are also seeing this issue in our environment. ~~It doesn't seem to happen with all images.~~ FWIW, on the same container instance, we can see some containers with tags and others without and if a container is one with tags, it's the first launched container.

Jul 08 '24 22:07 turacma

FWIW, this also impacts the ECS APIs, specifically describe-tasks

https://www.reddit.com/r/aws/comments/1dtgc4b/mismatching_image_uris_tag_vs_sha256_in_listtasks/

Unclear if the source of truth (and the root cause) is the agent or the APIs themselves, but just though it's worth noting this.

Jul 08 '24 22:07 turacma

Found this issue after internal investigation of an incident that seems likely related to this. If it helps anyone else, here's my analysis of how this impacted a service that was referencing an ECR image based on a persistent image tag that we were regularly rebuilding and overwriting, and had automation in place for deleting the older untagged images

I have an open support case with AWS to confirm this behaviour, and have included a link to this github issue.

sequenceDiagram
participant jenkins as Jenkins
participant cloudformation as Cloudformation
participant ecs-service as ECS Service
participant ec2-instances as EC2 Instances
participant ecr-registry as ECR Registry
participant docker-base-images as Docker Base Images<br />firelens sidecar image
participant ecr-lifecycle-policy as ECR Lifecycle Policy
jenkins ->> cloudformation: regular deployment
cloudformation ->> ecs-service: creates a new "deployment" for the service
activate ecs-service
note right of ecs-service: ECS resolves the image hash<br />at time of "deployment" creation
ecs-service ->> ec2-instances: starts tasks with resolved image hashes
ec2-instances ->> ecr-registry: pulls latest image from ECR
docker-base-images ->> ecr-registry: rebuid and push image regularly
ecr-lifecycle-policy ->> ecr-registry: deletes older images periodically
note right of ecs-service: periodically, new tasks need to start
ecs-service ->> ec2-instances: starts tasks with previously resolved image hashes
ec2-instances ->> ecr-registry: attempts to run the same image hash from earlier<br />if the image already exists on the instance, its fine<br />otherwise, it needs to pull from ECR again and may fail
ec2-instances ->> ecs-service: tasks fail to launch due to missing image
note right of ecs-service: at this point, the service is unstable<br />might have existing running tasks<br /> but it can't launch new ones
create actor incident as Incident responders
ecs-service ->> incident: begin investigation
note left of incident: "didn't this happen the other day<br />for another service?" *checks slack*
note left of incident: Yeah, it did happen, and the outcome<br />was that we disabled the ECR lifecycle<br />policy, but services were left with<br />the potential to fail when tasks cycle
incident ->> jenkins: trigger replay of latest production deployment early and hope that fixes the issue
jenkins ->> cloudformation: deploy
cloudformation ->> incident: "there are no changes in the template"
incident ->> jenkins: disable the sidecar to get the service up and running again quickly and buy more time for investigation
jenkins ->> cloudformation: deploy with sidecar disabled
deactivate ecs-service
cloudformation ->> ecs-service: create new deployment without sidecar
activate ecs-service
note right of ecs-service: no longer cares about firelens sidecar image
ecs-service ->> ec2-instances: starts new tasks
ec2-instances ->> ecs-service: success
ecs-service ->> incident: service is up and running again, everyone is happy
note left of incident: "but we're not done yet"
incident ->> jenkins: re-enable the sidecar
jenkins ->> cloudformation: deploy with sidecar enabled
deactivate ecs-service
cloudformation ->> ecs-service: create new deployment with sidecar
activate ecs-service
note right of ecs-service: ECS resolves the image hash<br />at time of "deployment" creation
ecs-service ->> ec2-instances: start new tasks
ec2-instances ->> ecr-registry: pulls new images with updated hash
ec2-instances ->> ecs-service: success
ecs-service ->> incident: service is stable again
note left of incident: This service looks good again now<br />but other services might still have a problem
deactivate ecs-service
incident ->> ecs-service: work through "Force New Deployment" for all services in all ecs clusters & accounts
note left of incident: all services are now expected to be<br />stable, as everything should be<br />referencing the latest firelens image<br />hash, and the lifecycle policy<br />to delete older ones is disabled

Jul 11 '24 03:07 joelcox22

This issue most probably comes from aws/amazon-ecs-agent#4177 merged in 1.83.0:

Expedited reporting of container image manifest digests to ECS backend. This change makes Agent resolve container image manifest digests for container images prior to image pulls by either calling image registries or inspecting local images depending on the host state and Agent configuration. Resolved digests will be reported to ECS backend using an additional SubmitTaskStateChange API call

Jul 11 '24 14:07 L3n41c

Downgrading to 1.82.4 in our case does not make the issue go away, indicating that, even if it was related to the agent, the digest information is now somehow cached by ECS. We are currently using a DAEMON ECS service.

According to a recent case opened with AWS support, "ECS now tracks the digest of each image for every service deployment of an ECS service revision. This allows ECS to ensure that for every task used in the service, either in the initial deployment, or later as part of a scale-up operation, the exact same set of container images are used." They added this is part of a rollout that started in the last few days of June and is supposed to complete by Monday.

Their suggested solution is to update the ECS service with "Force new deployment" to "invalidate" the cache. If you have AWS support, try to open a case including this information to see how they evaluate your issue.

Jul 11 '24 14:07 sjmisterm

I got a similar response to @sjmisterm in my support case, confirming the new behaviour is expected, and stating that we should no longer delete the images from ECR until we're certain that the images are no longer in use by any deployment.

This change effectively means ECR lifecycle policies to delete untagged images are expected to cause outages unless additional steps are taken immediately after every time an image is deleted to ensure every deployment that was referencing a mutable tag is redeployed. This is particularly problemattic for my specific use-case where we were referencing a mutable tag for a sidecar container that we include for many services.

I've asked if there is any future roadmap plans to make this use-case easier to manage, and requested for a comment from AWS on this github issue 😄

... https://xkcd.com/1172/

Jul 11 '24 23:07 joelcox22

AWS has confirmed this is definitely caused by them and they think this is a good feature, as the links (made available yesterday) show

https://aws.amazon.com/about-aws/whats-new/2024/07/amazon-ecs-software-version-consistency-containerized-applications/ https://aws.amazon.com/blogs/containers/announcing-software-version-consistency-for-amazon-ecs-services/ https://docs.aws.amazon.com/AmazonECS/latest/developerguide/deployment-type-ecs.html#deployment-container-image-stability

There's no way to turn off this new behaviour, which completely breaks the easiest workflow for blue-green deployments - I'm sure tons of people have other cases that need or benefit from the old one.

I suggest all who have AWS support to file a case and to request an API to turn this off by service / cluster / account.

Jul 12 '24 11:07 sjmisterm

Hello. I am from AWS ECS Agent team.

As shared by @sjmisterm above, the behavior change that customers are seeing is because of the recently released Software Version Consistency feature. The feature guarantees that same images are used for a service deployment by recording image manifest digests reported by the first launched task and then overriding tags with digests for all subsequent tasks of the service deployment.

Currently there is no way to turn off this feature. ECS Agent v1.83.0 included a change to expedite the reporting of image manifest digests but older Agent versions also report digests and ECS backend will override tags with digests in both cases. We are actively working on solutions to fix the regressions our customers are facing due to this feature.

Jul 12 '24 15:07 amogh09

One of the patches we are considering is - instead of overriding :tag with @sha256:digest, we would override it with :tag@sha256:digest so that the lost information is added back to the image references.

Jul 12 '24 16:07 amogh09

@amogh09 , I can't see how this would address the blue-green scenario. Could you explain it, please?

Jul 12 '24 16:07 sjmisterm

There's no way to turn off this new behaviour, which completely breaks the easiest workflow for blue-green deployments

@sjmisterm Can you please share more details on how this change is breaking blue-green deployments for you?

Jul 12 '24 16:07 amogh09

@amogh09 , sure.

Our blue-green deployments work by deploying a new image to the ECR repo tagged with latest and then launching a new EC2 instance (from the ECS-optimized image, properly configured for the cluster) while we make sure the new version works as expected in production. Then, we start to progressively drain the old tasks until only new tasks are available.

Jul 12 '24 16:07 sjmisterm

@amogh09 in summary: the software version "inconsistency" is what makes blue green a breeze with ECS. Should we want consistency, we'd use a digest or a version tag.

Jul 12 '24 16:07 sjmisterm

@sjmisterm Deployment unit for an ECS service is a TaskSet. The software version consistency feature guarantees image consistency at TaskSet level. In your case, how do you get a new task to be placed to the new EC2 instance? The new task needs to be a part of a new TaskSet to get the newer image version. If it belongs to the existing TaskSet then it will use the same image version as its TaskSet.

ECS supports blue-green deployments natively at service level if the service is behind an Application Load Balancer. You can also use External deployment type for an even greater control over the deployment process. Software Version Consistency feature is compatible with both of these.

Jul 12 '24 16:07 amogh09

@amogh09 I use a network load balancer and the LDAP container instances I'm running will not respond well to this new model. If I can't maintain the ability to pull the tagged latest image, I will have to stop using ECS and manage my own EC2s, which would be painful frankly.

Looking at the ECS API, what would happen if I called DeregisterTaskDefinition and then RegisterTaskDefinition. Would that have the effect of forcing ECS to resolve the digest from the new latest image without killing the running tasks?

Jul 12 '24 17:07 timdarbydotnet

@amogh09 , I think we're talking about different things. Until the ECS change, launching properly a new ECS instance properly configured for a ECS daemon service whose taskdef is tagged with :latest would launch the new task with, well, the image tagged latest. Now it launches it using the digest resolved by the first task unless you force a new deployment in your service.

Our deployment scripts pre-dates CodeDeploy and the other features. So all your suggestions require rewriting deployment code because of a feature we can't simply opt-out.

Jul 12 '24 17:07 sjmisterm

I understand the frustrations you all are sharing regarding this change. I request you to contact AWS Support for your issues. Our support team will be able to assist you with workarounds relevant to your specific setups.

Jul 12 '24 18:07 amogh09

@amogh09 , a simple API flag in the service / cluster / region / account would solve the problem. That's what we're trying to get across because it disturbs your customer base - not everyone pays support and the old behaviour, as you can see, is used by several of them.

Jul 12 '24 18:07 sjmisterm

I'll chime in that we were negatively impacted by this change as well, and I don't think it helps anything for most scenarios.

Before, customers effectively had a choice: they could either enforce software version consistency by using immutable tags (https://docs.aws.amazon.com/AmazonECR/latest/userguide/image-tag-mutability.html), or if they wanted to allow for a rolling release (most useful for daemon services as @sjmisterm alluded to) they could achieve that as well by using a mutable tag.

Now, this option is gone with nothing to replace it, and very poor notification that it was going to happen to boot.

Jul 12 '24 18:07 mpoindexter

I'm very disappointed with AWS on two counts:

From a technical standpoint, it appears that no consideration was given to how customers are actually using ECS.
The lack of prior communication for a change like this is shocking. I've seen AWS announce long lead times for far less impactful changes than this.

Jul 12 '24 19:07 timdarbydotnet

I know that the circumstances around how we all got notified about this change aren't ideal, but is there anywhere where we can be proactive and follow along for similar updates that may affect us in the future? Did folks get a mention from their AWS technical account managers or similar?

I lurk around the containers roadmap fairly often, but don't see an issue/mention there or in any other publicly-facing aws github project around this feature release.

Jul 16 '24 19:07 scott-vh

@scott-vh the problem is that this is an internal API change, ECS backend behaves differently now. This has nothing to do with ecs-agent itself, regardless of version you will get same behaviour. Noone could see it coming

Jul 16 '24 20:07 dg-nvm

@dg-nvm Yep I got that 👍 I was just curious if there was any breadcrumb anywhere else for which we could've seen this coming (sounds like no, but wanted to see if anyone who interfaces with TAMs or ECS engineers through higher tiers of support got some notice)

Jul 16 '24 20:07 scott-vh

@scott-vh our TAM was informed about the problem but idk if there was any proposal. Given that I see ideas for workarounds accumulating I would say no :D Luckily our CD was not impacted by this, I can think of scenario that deamons deployments is easier using mutable tags, especially that ECS does not play nicely when replacing daemons. Sometimes they are stuck because they got removed from the host and something else was put in it's place in the meantime :)

Jul 16 '24 21:07 dg-nvm