argo-workflows
argo-workflows copied to clipboard
ETCD throttling
Summary
Argo-Workflows produces a lot of API calls to ETCD and some of the requests can be canceled and Workflow transition between states can be disrupted. You can spot a error on the controller:
cannot validate Workflow: rpc error: code = ResourceExhausted desc = etcdserver: throttle: too many requests
Use Cases
There is a self-calculated throttling limits on AWS EKS and if your K8s cluster is small, ETCD can throttle some updates of Workflows. For example, we have many Cronworkflows and ETCD can skip the update from Running state of Workflow to Finished and it gets stuck forever because the controller is waiting when Workflow is going to be finished. Pods of such Workflow are gone when they are completed, and no sidecontainers will try to update the status again.
Probably, some mechanism of retries should be implemented to avoid this and have a guarantee that the state of Workflow will be changed.
For AWS EKS there is a workaround: scale you cluster up for a short period of time to boost the throttling limit.
Message from the maintainers:
Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.
Unfortunately, AWS EKS cluster API limits are changed dynamically, if your cluster has shrunk, the limits will be decreased as well
Experienced a similar situation on small EKS clusters (3 EC2 nodes) that use Fargate to exclusively run hundreds of CronWorkflows simultaneously. Apparently if you launch too many new nodes simultaneously it can cause the EKS control plane to become overloaded and start throwing the too many requests
error. Switching the CronWorkflows to run on EC2 instances seems to mitigate the issue (at least for us).
@makzzz1986 We are also facing this issue. Did you get some confirmation from AWS about the API dynamic resizing and what those thresholds are? What is taken into calculation to determine those limits, # of nodes, size of nodes, both?
@watkinsmike I've got a confirmation from AWS that it could be an issue for small clusters. Also, they suspect that Argo-Workflows creates multiple requests at the same moment to the same resource and it causes throttling. They have the ability to tweak the limit and we are checking how it will behave. I can suggest you to open a case about it and share its number, I will try to deliver it to AWS team who is owning this case.
We have been seeing the same issue on a small cluster. 15 m5.2xlarge nodes 8cores 32gb ram. 10 workflows each using 1core and 512MB ram. 8 of those failed with etcd throttling 🤷
did u solve?