argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

ETCD throttling

Open makzzz1986 opened this issue 2 years ago • 6 comments

Summary

Argo-Workflows produces a lot of API calls to ETCD and some of the requests can be canceled and Workflow transition between states can be disrupted. You can spot a error on the controller: cannot validate Workflow: rpc error: code = ResourceExhausted desc = etcdserver: throttle: too many requests

Use Cases

There is a self-calculated throttling limits on AWS EKS and if your K8s cluster is small, ETCD can throttle some updates of Workflows. For example, we have many Cronworkflows and ETCD can skip the update from Running state of Workflow to Finished and it gets stuck forever because the controller is waiting when Workflow is going to be finished. Pods of such Workflow are gone when they are completed, and no sidecontainers will try to update the status again.

Probably, some mechanism of retries should be implemented to avoid this and have a guarantee that the state of Workflow will be changed.

For AWS EKS there is a workaround: scale you cluster up for a short period of time to boost the throttling limit.


Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

makzzz1986 avatar Oct 10 '22 08:10 makzzz1986

Unfortunately, AWS EKS cluster API limits are changed dynamically, if your cluster has shrunk, the limits will be decreased as well

makzzz1986 avatar Oct 17 '22 09:10 makzzz1986

Experienced a similar situation on small EKS clusters (3 EC2 nodes) that use Fargate to exclusively run hundreds of CronWorkflows simultaneously. Apparently if you launch too many new nodes simultaneously it can cause the EKS control plane to become overloaded and start throwing the too many requests error. Switching the CronWorkflows to run on EC2 instances seems to mitigate the issue (at least for us).

mcntrn avatar Dec 03 '22 01:12 mcntrn

@makzzz1986 We are also facing this issue. Did you get some confirmation from AWS about the API dynamic resizing and what those thresholds are? What is taken into calculation to determine those limits, # of nodes, size of nodes, both?

watkinsmike avatar Apr 05 '23 16:04 watkinsmike

@watkinsmike I've got a confirmation from AWS that it could be an issue for small clusters. Also, they suspect that Argo-Workflows creates multiple requests at the same moment to the same resource and it causes throttling. They have the ability to tweak the limit and we are checking how it will behave. I can suggest you to open a case about it and share its number, I will try to deliver it to AWS team who is owning this case.

makzzz1986 avatar Apr 05 '23 19:04 makzzz1986

We have been seeing the same issue on a small cluster. 15 m5.2xlarge nodes 8cores 32gb ram. 10 workflows each using 1core and 512MB ram. 8 of those failed with etcd throttling 🤷

devjerry0 avatar Nov 23 '23 14:11 devjerry0

did u solve?

tooptoop4 avatar May 15 '24 11:05 tooptoop4