containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[EKS/Fargate] request: Improve Fargate Node Startup Time

Open lilley2412 opened this issue 4 years ago • 32 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request Improve the startup time of Fargate nodes on EKS

Which service(s) is this request for? EKS Fargate

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? I haven't done extensive benchmarks yet, but anecdotal "kicking the tires" of Fargate for EKS shows 30-45 seconds for a node to be created and registered; because it's node-per-pod, I then have to wait for image pull and container start time, so in total it's taking over a minute to start a new pod.

This is problematic for obvious reasons; for some use-cases it's not a show-stopper, like HPA-based scaled deployments I'm OK with the startup time. For others, like a CI cluster for gitlab, the startup time is painful; each CI job spawns a new pod which takes "forever".

Are you currently working around this issue? Currently just eating the startup time.

lilley2412 avatar Dec 12 '19 23:12 lilley2412

Same usecase. Would love to migrate some Github Action workloads to EKS/Fargate for ease of integration, but current pod boot time is a showstopper. In our case it's even more exacerbated, because we'd use Argo Workflows, which launches a pod per workflow step. And Github Action VMs launch incredibly fast, we even had to abandon AWS CodeBuild because of it's VM launch time delay, especially on medium and large instances.

unthought avatar Apr 17 '20 13:04 unthought

Same here - we are trying to put spark jobs on EKS/Fargate. For long-running spark jobs, this is not a big deal, but to streamline our stack, we also have many shorter spark jobs, for which this is effectively unacceptable.

jgoeres avatar Apr 28 '20 12:04 jgoeres

Is this an issue with AWS not having the appropriate EC2 capacity ready and warmed ahead of time for node spooling? Certainly feels like it could be an EC2 initialization happening behind the scenes during each and every pod deploy.

booleanbetrayal avatar Sep 05 '20 18:09 booleanbetrayal

@booleanbetrayal this is not about having a EC2 instances up and running beforehand. There is a lot of mechanics happening behind the scenes that adds up (e.g. connecting the ENI to the customer VPC, etc). Also, Kubernetes doesn't really have a "serverless" mode of operation so the instance we use needs to be initialized with the Kubernetes client code, it needs to virtually connect to the cluster and show up as a fully initialized worker node and only then the pod can be scheduled on it. So while we try to make the user experience as "serverless-ly" as possible, the mechanic behind the scene is more complex than taking a running EC2 to deploy the container. We do appreciate that for a long running task this initialization isn't a big deal but for short running tasks it adds up. We are working hard to reduce the time it takes for these steps to execute and/or removing some of these steps (where possible).

mreferre avatar Sep 07 '20 13:09 mreferre

Thanks for the clarification @mreferre !

booleanbetrayal avatar Sep 07 '20 14:09 booleanbetrayal

It seems very inefficient. 1 pod = 1 fargate node? I was just testing this and it takes 60s in eu-west-1 to find a fargate node to run a pod on and then it starts running. If you are used to the normal pod spin up on k8s its a second or less depending on whether the node has the image cached. Would it not be better to implement a virtual kubelet for a node which can then spin up lots of pods and thus run much quicker? Or am I missing something and doing something terribly wrong? I did get caught out by the fargate private subnets needing a NAT route to get out to the internet to pull images. I'll continue my testing to see if I can get fargate running faster!

spicysomtam avatar Nov 08 '20 22:11 spicysomtam

@spicysomtam I think 50/60 seconds is just about right and you won't be able to reduce it substantially (also, it depends on the image size but even tiny images won't take less than 45 seconds because of everything above). We are working to reduce this timing over time. If all you need is fast single pod startup time than EKS managed node groups are a great solution. Fargate's value isn't in single pod start time but rather in the additional security that running pods in dedicated os kernels bring and the fact you no longer have to manage/scale/life-cycle nodes (you can read more here). A couple of years ago we did look into the virtual Kubelet project to run Fargate pods but 1) this won't have a different effect on pod startup time given all the virtual Kubelet does is proxy requests to a backend (Fargate in this case) and the timing experience would have been similar and 2) virtual Kubelet is a (heavy) fork of the Kubelet and we did not want to go down that path.

mreferre avatar Nov 09 '20 09:11 mreferre

I did testing with node groups and Cluster Autoscaling (CA) and the pod spin up times are what I am used with k8s (most of my experience is with Openshift and Rancher). The only slight issue, which isn't AWS related, is spin up time for nodes via the CA when pods are stuck in pending, but you could work around that via placement pods with a lower pod priority; k8s would kill these and replace them with your real pods pending and in the background a new node would be spun up.

spicysomtam avatar Nov 09 '20 09:11 spicysomtam

Sure. If you want to over-index on optimizing pods startup times and you are not constrained by the additional costs of idle resources that is the right thing to do. With Fargate we aim at removing a lot of undifferentiated heavy lifting associated to managing the underlying infrastructure but, depending on your use case and objectives, it may not (yet) be the right service for you. We are working on this.

mreferre avatar Nov 09 '20 09:11 mreferre

we are experiencing the same issue with Fargate in ECS... It doesn't seems to be exclusively related to EKS.

nyouna avatar Dec 02 '20 22:12 nyouna

@mikestef9 can you please add ECS label?

nyouna avatar Dec 02 '20 22:12 nyouna

Just tried using pod priority to work around this for gitlab runners. Unfortunately, it looks like Fargate is overriding all pod priority settings with system-node-critical so unless I'm missing something, this isn't viable work around.

Understandable, the idea of spare/idle capacity in a Serverless env doesn't really make sense in hindsight.

Not blaming Fargate here, still a great tool for other workloads, just hoping to save other folks some time.

nalbury avatar Aug 15 '21 15:08 nalbury

Just tried using pod priority to work around this for gitlab runners. Unfortunately, it looks like Fargate is overriding all pod priority settings with system-node-critical so unless I'm missing something, this isn't viable work around.

Understandable, the idea of spare/idle capacity in a Serverless env doesn't really make sense in hindsight.

Not blaming Fargate here, still a great tool for other workloads, just hoping to save other folks some time.

Yes. Pod priorities make sense when multiple pods need to compete for the same node resources (the priority is used to disambiguate which pods are more important than other and hence which should "win"). In the context of Fargate this is not a problem because the "node" is just a "second class citizen" (or rightly sized dedicated capacity used to fulfill all pods requests in a 1:1 setup).

mreferre avatar Aug 16 '21 09:08 mreferre

Fargate is a bad solution. The pod-per-vm concept kills the point of containers.

In my opinion the ideal solution would be something closer to cluster-autoscaler which ran in the control plane (so, didn't have to be installed and worked when there's no nodes) and launched instances directly so that one didn't have the added complexity of node groups. I think that besides lower user-facing complexity, getting rid of ASGs would probably simplify the implementation logic a great deal too.

almson avatar Sep 30 '21 14:09 almson

Fargate is a bad solution. The pod-per-vm concept kills the point of containers.

In my opinion the ideal solution would be something closer to cluster-autoscaler which ran in the control plane (so, didn't have to be installed and worked when there's no nodes) and launched instances directly so that one didn't have the added complexity of node groups. I think that besides lower user-facing complexity, getting rid of ASGs would probably simplify the implementation logic a great deal too.

I think there is a need for both. To protect from the (relatively frequent) containers escapes the notion of a dedicated VM/microVM per pod is gaining A LOT of traction in the industry (especially in the highly regulated segments of the industry). Then for those situations where this is not required and/or there is a desire to have a more classic "cluster of multi-tenant nodes" all you said makes a lot of sense and there is work being done to address it.

mreferre avatar Oct 02 '21 15:10 mreferre

Fargate is a bad solution. The pod-per-vm concept kills the point of containers. In my opinion the ideal solution would be something closer to cluster-autoscaler which ran in the control plane (so, didn't have to be installed and worked when there's no nodes) and launched instances directly so that one didn't have the added complexity of node groups. I think that besides lower user-facing complexity, getting rid of ASGs would probably simplify the implementation logic a great deal too.

I think there is a need for both. To protect from the (relatively frequent) containers escapes the notion of a dedicated VM/microVM per pod is gaining A LOT of traction in the industry (especially in the highly regulated segments of the industry). Then for those situations where this is not required and/or there is a desire to have a more classic "cluster of multi-tenant nodes" all you said makes a lot of sense and there is work being done to address it.

The description of Fargate makes clear its aim:

AWS Fargate is a technology that provides on-demand, right-sized compute capacity for containers. With AWS Fargate, you don't have to provision, configure, or scale groups of virtual machines on your own to run containers.

If what you're interested in is isolation, then containerd-firecracker is a better approach that's agnostic to how your nodes are scaled. Fargate is not a microVM. It's a full VM that runs kubelet, etc, and has more attack surface than a microVM hosting a pod. It's also less efficient, takes longer to launch, etc.

almson avatar Oct 03 '21 09:10 almson

If what you're interested in is isolation, then containerd-firecracker is a better approach that's agnostic to how your nodes are scaled. Fargate is not a microVM. It's a full VM that runs kubelet, etc, and has more attack surface than a microVM hosting a pod. It's also less efficient, takes longer to launch, etc.

Yes. The Kubelet is part of the VM/microVM and that is what makes the cluster the security boundary as described in the docs. Using a microVM to shield the pod only (leeaving the kubelet outside is an alternative but the key point here is that won't solve the problem that you'd need to provision a node on the fly to connect to the cluster when the user deploys the pod (if we want to stick to the "pay per pod" model).

mreferre avatar Oct 04 '21 07:10 mreferre

the key point here is that won't solve the problem that you'd need to provision a node on the fly to connect to the cluster when the user deploys the pod (if we want to stick to the "pay per pod" model).

Who wants to stick to the pay-per-pod model? A typical user would lose a lot of money on Fargate because each node has to be sized to a pod's resource limit (ie, overprovisioned for peak load), while a multi-tenant node is sized to the sum of the pods' resource requests (which is typically a lot smaller) and each pod can "burst" as much as it wants. You also lose money on the coarse granularity of Fargate resource requests.

almson avatar Oct 04 '21 11:10 almson

the key point here is that won't solve the problem that you'd need to provision a node on the fly to connect to the cluster when the user deploys the pod (if we want to stick to the "pay per pod" model).

Who wants to stick to the pay-per-pod model? A typical user would lose a lot of money on Fargate because each node has to be sized to a pod's resource limit (ie, overprovisioned for peak load), while a multi-tenant node is sized to the sum of the pods' resource requests (which is typically a lot smaller) and each pod can "burst" as much as it wants. You also lose money on the coarse granularity of Fargate resource requests.

Doesn't Fargate sound more efficient than provisioning your own resources via autoscaler since you can't always predict workloads?

Saeger avatar Oct 04 '21 12:10 Saeger

@Saeger You don't need to predict workloads... you're autoscaling...

almson avatar Oct 04 '21 12:10 almson

@almson it's not just about the pay-per-pod-model. Flexible resource usage is an area where standard nodes can help but people like Fargate because everything that you do not need to think about when deploying a pod (and surely Fargate could do better here). IMO we are all so used to deploy a cluster of EC2 instances to launch pods/tasks that we lose sight that, to launch EC2 instances, you don't need to manage a rack of physical servers.

mreferre avatar Oct 04 '21 12:10 mreferre

@Saeger You don't need to predict workloads... you're autoscaling...

I think this is easier to say and harder to accomplish. Autoscaling tends to end-up over-provisioning resources. But maybe my workloads faces too many corner cases for the discussion here. I see pros and cons from both approaches tbh.

Saeger avatar Oct 04 '21 12:10 Saeger

I also want to be clear that I am not trying to be dismissive/defensive here. @almson you are bringing a ton of good feedback and perspective to the table. I agree with @Saeger that the world has nuances and there are different needs that need to be tackled with different approaches.

mreferre avatar Oct 04 '21 12:10 mreferre

I've been wanting to use ECS + Fargate for a scale-to-zero web app, but with a 60-90s startup time I'm realizing this approach is probably infeasible.

Edit - Reading around a little more, it seems that I was operating on incorrect assumptions to begin with: #1017

forresthopkinsa avatar Jan 30 '22 05:01 forresthopkinsa

two and a half years later and this is still a issue. Fargate with EKS could have so great potential, but with the slow scaling capability its a real bummer.

project0 avatar Jun 11 '22 19:06 project0

two and a half years later and this is still a issue. Fargate with EKS could have so great potential, but with the slow scaling capability its a real bummer.

That is because fargate uses a EC2 vm which is slow and not a container. AWS is vm centric. Try gcp gke; its much better than eks.

spicysomtam avatar Jun 11 '22 21:06 spicysomtam

That is because fargate uses a EC2 vm which is slow and not a container. AWS is vm centric.

This is not entirely true. AWS developed firecracker, a microVM hypervisor. Technically nothing is stopping them from starting fargate pods within seconds. Even ec2 instances starts much faster. I can only imagine some weird or bad service in between delays scheduling.

project0 avatar Jun 11 '22 23:06 project0

I'm working in a product migration into AWS, and I'm using fargate. The docker image is around 5 GB, and it takes more than 3 min to get the container running. I don't reach anywhere close to the reported 45s (that sounds like a dream).

Is it possible to use ec2 containers instead of fargate and not have to deal with the startup delay?

AffiTheCreator avatar Sep 08 '22 10:09 AffiTheCreator

I'm working in a product migration into AWS, and I'm using fargate. The docker image is around 5 GB, and it takes more than 3 min to get the container running. I don't reach anywhere close to the reported 45s (that sounds like a dream).

Is it possible to use ec2 containers instead of fargate and not have to deal with the startup delay?

The 45 second is how much it takes (more or less) to prepare the infrastructure and start the pull of the image. 5GB is an immense image and it's not surprising that it takes that long. You could reduce that by configuring larger pods (because they will land on larger instances with more cpu/network capacity) but there is a cost associated to that and you probably don't want to spend more of what you'd need at run-time "just" to speed up the start-time. If your start-time is very much skewed towards the image size you may want to keep an eye on other work we are doing in this area (e.g. here: https://github.com/aws/containers-roadmap/issues/696).

To answer your question, yes absolutely you can use EC2 (typically an EKS managed node group) to deploy your pods. As long as your nodes don't churn too much (scaling in and scaling out) they can cache images lowering drastically start-times.

mreferre avatar Sep 09 '22 10:09 mreferre

The product works with real-time video input so the startup time is a must since we need to be able to process the video on demand. Because we work with video the pod settings are already one of the beefiest containers aws offers - 4vCPU and 8GB RAM.

I took a look at #696 and I might implement some ideas, thank you for the link.

Regarding the EC2 containers answer, we have implemented a sort of container recycle cycle within the product but there is a tradeoff between time and cost associated to the infrastructure. There is only so much time we can keep the container running before it becomes more expensive than to launch a new task.
"As long as your nodes don't churn too much (scaling in and scaling out) " - the problem is that's what the product is supposed to do

The caching solution would indeed solve our issues, do you have a timeline for when we should expect this feature? I ask this because I might need to change the infrastructure if it’s going to take a long time.

AffiTheCreator avatar Sep 13 '22 13:09 AffiTheCreator