containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[Fargate/ECS] [Image caching]: provide image caching for Fargate.

Open matthewcummings opened this issue 4 years ago • 142 comments

EDIT: as @ronkorving mentioned, image caching is available for EC2 backed ECS. I've updated this request to be specifically for Fargate.

What do you want us to build? I've deployed scheduled Fargate tasks and been clobbered with high data transfer fees pulling down the image from ECR. Additionally, configuring a VPC endpoint for ECR is not for the faint of heart. The doc is a bit confusing.

It would be a big improvement if there were a resource (network/host) local to the instance where my containers run which could be used to load my docker images.

Which service(s) is this request for? Fargate and ECR.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? I don't want to be charged for pulling a Docker image every time my scheduled Fargate task runs. On that note the VPC endpoint doc should be better too.

Are you currently working around this issue? This was for a personal project, I instead deployed an EC2 instance running a cron job, which is not my preference. I would prefer using Docker and the ECS/Fargate ecosystem.

matthewcummings avatar Jan 14 '20 19:01 matthewcummings

@matthewcummings can you clarify which doc you're talking about ("The doc is horrific")? Can you also clarify which regions your Fargate tasks and your ECR images are in?

jtoberon avatar Jan 15 '20 23:01 jtoberon

@jtoberon can we have these kinds of things in every region? I generally use us-east-1 and us-west-2 these days.

matthewcummings avatar Jan 15 '20 23:01 matthewcummings

It seems better now https://docs.aws.amazon.com/AmazonECR/latest/userguide/vpc-endpoints.html. It has been updated from what I can see.

However, it still feels like a leaky abstraction. I'd argue that I shouldn't need to know/think about S3 here. Nowhere else in the ECS/EKS/ECR ecosystem do we really see mention of S3.

It would be great if the S3 details could be "abstracted away".

matthewcummings avatar Jan 15 '20 23:01 matthewcummings

Regarding regions, I'm really asking whether you're doing cross-region pulls.

You're right: this is a leaky abstraction. The client (e.g. docker) doesn't care, but from a networking perspective you need to poke a hole to S3 right now.

Regarding making all of this easier, we plan to build cross-region replication, and we plan to simplify the registry URL so that you don't have to think as much about which region you're pulling from. https://github.com/aws/containers-roadmap/issues/140 has more details and some discussion.

jtoberon avatar Jan 16 '20 00:01 jtoberon

Ha ha, thanks. Excuse my snarkiness. . . I am not doing cross-region pulls right now but that is something I may need to do. Thank you!

matthewcummings avatar Jan 16 '20 00:01 matthewcummings

@jtoberon your call on whether this should be a separate request or folded into the other one.

matthewcummings avatar Jan 16 '20 00:01 matthewcummings

Wait, aren't you really asking for ECS_IMAGE_PULL_BEHAVIOR control?

This was added (it seems) to ECS EC2 in 2018: https://aws.amazon.com/about-aws/whats-new/2018/05/amazon-ecs-adds-options-to-speed-up-container-launch-times/

Agent config docs.

I get the impression Fargate does not give control over that, and does not have it set to prefer-cached or once. This is what we really need, isn't it?

ronkorving avatar Jan 17 '20 06:01 ronkorving

@ronkorving yes, that's exactly what I've requested. I wasn't aware of the ECS/EC2 feature. . . thanks for pointing me to that. However, a Fargate option would be great. I'm going to update the request.

matthewcummings avatar Jan 18 '20 16:01 matthewcummings

much needed indeed this caching option for fargate

koxon avatar Jan 24 '20 11:01 koxon

I would like to upvote this feature too. I'm using Fargate at work and our images are ~1GB and it takes very long to start the task because it needs to redownload the image from ECR all the time. If there was some way to cache the image just like the way it's possible for ECS on EC2, then this would be extremely beneficial.

rametta avatar Jan 30 '20 14:01 rametta

How's this evolving?

There are many use cases where what you need is just a Lambda with unrestricted access to a kernel / filesystem. Having Fargate with cached / hot images perfectly fits this use case.

andrestone avatar Feb 17 '20 16:02 andrestone

@jtoberon @samuelkarp I realize that this is a more involved feature to build than it was on ECS with EC2 since the instances are changing underneath across AWS accounts, but are you able to provide any timeline on if and when this image caching would be available in Fargate? Lambda eventually fixed this same cold start issue with the short-term cache. This request is for the direct analog in Fargate.

Our use case: we run containers on-demand when our customers initiate an action and connect them to the container that we spin up. So, it's a real-time use case. Right now, we run these containers on ECS with EC2 and the launch times are perfectly acceptable (~1-3 seconds) because we cache the image on the EC2 box with PULL_BEHAVIOR.

We'd really like to move to Fargate but our testing shows our Fargate containers spend ~70 seconds in the PENDING state before moving to the RUNNING state. ECR reports our container at just under 900MB. Both ECR and the ECS cluster are in the same region, us-east-1.

We have to make some investments in the area soon so I am trying to get a sense for how much we should invest into optimizing our current EC2-based setup because we absolutely want to move to Fargate as soon as this cold start issue is resolved. As always, thank you for your communication.

fitzn avatar Feb 21 '20 14:02 fitzn

I wish Fargate could have some sort of caching. Due to lack of environment variables my task just kept falling during all weekend. And every restart meant that new image will be downloaded from docker hub. In the end I've faced with horrible traffic usage, since Fargate had been deployed within private VPC. Of course there is an endpoint (Fargate requires both ECR and S3 as I understood), but still some sort of caching would be much cheaper and predictable option.

Brother-Andy avatar Mar 10 '20 14:03 Brother-Andy

@Brother-Andy For this use-case, I built cdk-ecr-sync which syncs specific images from DockerHub to ECR. Doesn't solve the caching part but might reduce your bill.

pgarbe avatar Mar 17 '20 06:03 pgarbe

Ditto on the feature. We use containers to spin-off cyber ranges for students. Usage can fluctuate from 0 to thousands, Fargate is the best solution for ease of management, but the launch time is a challenge even with ECR. Caching is a much-needed feature.

pyx7b avatar Apr 05 '20 04:04 pyx7b

+1

narzero avatar Apr 25 '20 16:04 narzero

+1

klatu201 avatar May 05 '20 05:05 klatu201

Same here, I need to run multiple Fargate cross-region and it takes around a minute to pull the image. Once pulled, the task only takes 4 seconds to run. This completely stops us from using Fargate.

rouralberto avatar May 20 '20 02:05 rouralberto

we had the same problem, the Fargate task should take only 10 seconds to run but it takes like a minute to pull the I image :(

nmqanh avatar May 29 '20 02:05 nmqanh

Is that possible to use EFS file system to store image and the task just run this image? Or that is the same question of pulling from EFS to VPS which storing the container?

congthang1 avatar Jun 06 '20 10:06 congthang1

Azure is solving this problem in their plataform https://stevelasker.blog/2019/10/29/azure-container-registry-teleportation/

amunhoz avatar Jul 04 '20 19:07 amunhoz

+1 we run a very large number of tasks and 1GB image. This would significantly speed up our deploys and would be a super helpful feature. We're considering moving to EC2 due to Fargate deployment slowness and this is one of the factors.

nakulpathak3 avatar Jul 28 '20 18:07 nakulpathak3

Currently using Gitlab Runner Fargate driver which is great, except for the spinup time ~1-2 minutes for our image (> 1gb) because it has to pull it from ECS for every job. Not super great.

Would really like to see some sort of image caching.

MattBred avatar Aug 05 '20 22:08 MattBred

I have 1GB Containers with no way of reducing the size of it. It takes very long time to start up on fargate.

We really need caching features

alicancakil avatar Aug 13 '20 00:08 alicancakil

+1 on this we really need this feature.

SunnyGurnani avatar Sep 23 '20 05:09 SunnyGurnani

The amount of time wasted by this not being a thing is no doubt staggering, and continuous to grow as AWS does not address this.

AWS, we could really do with some communication here. I thought that was the point of this repo.

ronkorving avatar Sep 25 '20 01:09 ronkorving

This is one of those weird cases where we are paying for poor performance, bandwidth usage + a 3 min image pull on restart/deploy.

djerraballi avatar Sep 26 '20 00:09 djerraballi

We have work in progress on image pull performance, in particular for images stored in ECR. In the meantime, our metrics and performance testing is showing more consistent image pull performance with platform version 1.4 compared to platform version 1.3, especially looking at p90 and above.

When it comes to image caching specifically, could you expand a little bit on what you would like to see? How would you like control what images should be cached for example?

mlanner-aws avatar Sep 28 '20 02:09 mlanner-aws

@mlanner-aws Personally, I just want to see quick bootup times in Fargate (which are currently overshadowed by image pull time). I don't have much desire to control much about that, but that may be different for other people on this thread. I just want it to be fast by default.

ronkorving avatar Sep 28 '20 02:09 ronkorving

TL;DR Ability to set ECS_IMAGE_PULL_BEHAVIOR to prefer-cached in Fargate. Right now it's always by design limitation of Fargate that we want a workaround for.

When it comes to image caching specifically, could you expand a little bit on what you would like to see?

@mlanner-aws, the expectation I had in mind was that we essentially get EC2-like caching where there is perhaps some common cache that Fargate tasks already have access to and so when they download an image from ECR or otherwise, they are only downloading the Docker layers that have changed since the previous image.

How would you like control what images should be cached for example?

I think any image a task uses (or at least the largest to begin with) would use the above-like functionality where the pull causes a local cache of the image for future pulls. If the image gets completely invalidate by a very early Docker layer invalidation and takes a long time, that's expected and would be the same in EC2 as well.

As @amunhoz pointed out above, Azure has been able to implement this (https://stevelasker.blog/2019/10/29/azure-container-registry-teleportation/).

nakulpathak3 avatar Sep 28 '20 17:09 nakulpathak3