containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[ECS] [Container image resolution]: Allow feature to be disabled (or make it opt-in)

Open jakauppila opened this issue 1 year ago • 9 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request It was announced on 7/11/2024 that for any services created or updated after June 25, 2024 within Amazon ECS that container image tags would be resolved to the image digest and will be used going forward to ensure software version consistency.

This change in behavior was not communicated, was not opt-in behavior, or even gated with a new Fargate platform version.

We relied on the previous behavior by pointing application-defined Task Definitions to centralized managed sidecar images that leveraged mutable tags so that when a new version is pushed, any consuming task definitions will immediately start using it without requiring a deployment by hundreds or thousands of applications.

Which service(s) is this request for? ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? We were leveraging the previous ability to point at mutable container image tags to roll-out centrally managed sidecars without action needed by our application developer customers.

Are you currently working around this issue? To resolve the problem of failing applications, we had to restore the old container images to ECR with the SHA that was previously resolved to; historically we have purged the old when we push the new.

Additional context What's New: https://aws.amazon.com/about-aws/whats-new/2024/07/amazon-ecs-software-version-consistency-containerized-applications/ Blog Post: https://aws.amazon.com/blogs/containers/announcing-software-version-consistency-for-amazon-ecs-services/ Documentation: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/deployment-type-ecs.html#deployment-container-image-stability

jakauppila avatar Jul 18 '24 16:07 jakauppila

Hi,

I'd like to emphasize the importance of the requested feature to disable the new ECS image tag resolution behavior. This change has disrupted our deployment strategy, which relies on using the latest tag for blue-green and rolling updates.

The flexibility of using mutable tags allowed us to manage deployments without extra steps. This ECS change has increased our operational overhead, requiring additional deployment steps for every update.

I'd like to request an option to disable this new functionality at the service, cluster, or account level, allowing us to maintain our current deployment process.

danielferraz-git avatar Jul 24 '24 12:07 danielferraz-git

For now, I believe the suggested workaround should be officially documented: #2402.

danielferraz-git avatar Jul 25 '24 14:07 danielferraz-git

Great! It's good to know that.

DevAssis avatar Jul 25 '24 16:07 DevAssis

Great! It's good to know that.

DevAssis avatar Jul 25 '24 16:07 DevAssis

Got caught by this today when one of our tasks needed to restart due to memory overload and then eventually was killed cos restart was unable to download a datadog sidecar image that we were referencing in TaskDefinition by floating tag, but the tag had moved to a new version and the old image had been purged (we host copies of datadog in our own ECR). I hate this new feature - I'm compentent enough to use unique buildserver assigned tags for containers that count, but when I decide to use a floating tag (eg based on SemVer) I understand that I may have small internal inconsistencies that I accept. At the very least, allow us to override this new default...

pmcevoy avatar Aug 07 '24 11:08 pmcevoy

Cross-posting the message I posted on issue #2394. Sorry for the late response on this thread- we're aware of the impact this change has had and apologize for the churn this rollout has created. We've been actively working through the set of issues that have been highlighted on this thread and have 2 updates to share: 1/for customers who've been impacted by the lack of ability to see image tag information, we're working on a change that will bring back image tag information in the describe-tasks response, in the same format as was available prior to the release of version consistency (i.e image:tag). An important thing to keep in mind here is that if you run docker ps on the host, you will see the image in format image:tag but docker inspect will return image:tag@digest. 2/ We're also working on adding a configuration in the container definition that will allow you to opt-out of digest resolution for specific containers within your task- this should address both customers who want to completely opt out of digest resolution as well as customers who want to disable resolution for specific sidecar containers. I'll be using this issue to share updates on the change to disable digest resolution for specific containers and issue #2394 for updates on the change to bring back image tag information. We're tracking both changes at high-priority.   Once again, we regret the churn this change has caused you all. While we still believe version consistency is the right behavior for the vast majority of applications hosted on ECS, we fully acknowledge that we could have done a better job socializing these changes and addressing these issues before, rather than after making the change.

vibhav-ag avatar Aug 07 '24 18:08 vibhav-ag

Could you please provide an estimate for when this work will be complete? I echo the feelings voiced in https://github.com/aws/containers-roadmap/issues/2394 that the "software version consistency" feature wasn't rolled out properly, and should be reverted until this new opt-in process is in place.

matdelong avatar Aug 28 '24 12:08 matdelong

For anyone else who's been suffering downtime thanks to the ECS service regression described in this ticket & #2394, I tried to have support disable it for our accounts but found that did not work: SVC is still pushing services into SERVICE_TASK_START_IMPAIRED if they use things like the Amazon X-Ray, CloudWatch, etc.

I ended up deploying a little bit of EventBridge + Lambda to avoid ECS-triggered downtime. This uses an EventBridge rule to trigger a Lambda for ECR push events on the repositories in question and that Lambda calls ecs:UpdateService for each service using that container to force a new deployment which will resolve the tag to the current digest value. With the various work to manage IAM entities, least-privilege policies, etc. this seems like an unnecessary amount of work simply to get back to the level of reliability which ECS had from its launch until June.

acdha avatar Sep 04 '24 20:09 acdha

I find it incredibly concerning that a forced change that impacts production systems has not been rolled back for 3 months already. Not to mention that this change has been released without any further notice nor is there any workaround (I do not consider "Force new deployment" a workaround as it's good for new deployment, but not the case mentioned over here: https://github.com/aws/containers-roadmap/issues/2394#issuecomment-2305888840 )

rafaljanicki avatar Oct 01 '24 11:10 rafaljanicki

I just hit with this feature breaking our deployment of a particular service.

This app is not deployed with --force-new-deployment because of other issues with ECS deploys related to long-running processes that take days to exit. Instead, all nodes are marked DRAINING so that new nodes are created with the updated container image. Because the service revision is never updated with the new sha, the new nodes pull down an old container image.

Oddly, there's no way to update the service revision with the new sha without triggering an actual deploy.

I need to opt-out ASAP please.

rs-garrick avatar Nov 07 '24 00:11 rs-garrick

This significant change has been forced without much communication, this should have been an opt-in change rather than opt-out or at the very least allowed people to opt-out this new behaviour. We have been waiting a few months and would like to know the plan for remediation.

vhadianto avatar Nov 07 '24 03:11 vhadianto

Update 2: you now have the ability to disable consistency for specific containers in your task by configuring the new versionConsistency field for each container in the task definition. Any changes to this property are applied after a deployment. Once again, we regret the churn this change has caused you all.

What’s New Post

vibhav-ag avatar Nov 19 '24 22:11 vibhav-ag

So now I have to go over dozens of task definitions to revert your changes that were enforced on us? Eh, not great

rafaljanicki avatar Nov 20 '24 07:11 rafaljanicki

Within the AWS console when making a task definition revision with JSON, the versionConsistency option is not yet available. When will I be able to update it?

felicienveldema avatar Nov 21 '24 14:11 felicienveldema

Could we get an account (or org) level configuration to set the default value of that option? Then users could decide to disable it by default and opt-in instead of forcing everyone to opt-out.

jakauppila avatar Nov 21 '24 14:11 jakauppila

Hi @felicienveldema -

We are currently in the process of updating the JSON schema used in the console's editor. For now, you can safely ignore the warning and submit the updated JSON directly.

Version Consistency can be turned off for each container by setting its value to disabled like so:

{
    "family": "task-def-name",
    "containerDefinitions": [
        {
             "name": "container-name",
             "image": "image-uri",
+            "versionConsistency": "disabled"
        }
    ],
}

The warning in the editor does not prevent the JSON from being submitted to the API. We are actively working on providing spellcheck and auto-complete support for this new field.

Thanks, Yurui

pallymore avatar Nov 22 '24 02:11 pallymore

Hi @felicienveldema -

The ECS Console has been updated with the latest schema - you should be able to use the JSON editor language features to configure this field now.

Thanks!

pallymore avatar Nov 26 '24 21:11 pallymore

Update 2: you now have the ability to disable consistency for specific containers in your task by configuring the new versionConsistency field for each container in the task definition. Any changes to this property are applied after a deployment. Once again, we regret the churn this change has caused you all.

Thank you for this - ECS is back to being reliable again, which is a relief after the outages caused by the release of the version consistency feature. However, to echo @jakauppila, it would be useful to have an account-wide way to disable this so we won't have future outages if anyone forgets to disable it in a new task definition.

Since popular AWS services like CloudWatch and X-Ray encourage deployments which software version consistency will turn into outages that is an ever-present risk and there's no harm to disabling it since the version consistency feature doesn't add new capabilities which weren't already available.

acdha avatar Dec 13 '24 15:12 acdha

Is there a reason why this feature is not given an account-level setting that is available for other configurations including a change in default behavior for another ECS setting recently?

jakauppila avatar Jun 03 '25 16:06 jakauppila