argo-rollouts
argo-rollouts copied to clipboard
High CPU usage with many old analysisRun objects
Hello 👋 !
We've came across some interesting performance/scalability issue: the argo-rollouts controller does use quite a bit of CPU when there are many old (non-active) analysisRun object in the cluster. Deleting them fix the performance issue completely
The performance issue wasn't too much of a problem to be honest but the resource usage is surprising (especially for "inactive" objects)
Checklist:
- [X] I've included steps to reproduce the bug.
- [X] I've inclued the version of argo rollouts.
Describe the bug
With 6600 analysisRun (and 0 active), the controller was using an average of about ~1.5 cores and spiked to ~15 whereas after deleting old of them we were back at 0.2 cores average with spike at ~1.7 cores. See attached screenshot
I haven't yet to look at the controller codes but I suspect the controller still list them and publish metrics/do so kind of processing that is quite costly
Screenshot
To Reproduce
Create many analysisRun objects and look at the CPU usage
Expected behavior
The CPU usage of the controller should be increase drastically with the number of old, inactive, analysisRun objects
Version
Argo-rollout version: v1.1.1
Definition We're using some relatively simple ClusterAnalysisTemplate such as:
Name: server-rpc-error-rate
Namespace:
Labels: <none>
Annotations: <none>
API Version: argoproj.io/v1alpha1
Kind: ClusterAnalysisTemplate
Metadata:
Creation Timestamp: 2022-03-09T14:04:33Z
Generation: 1
Resource Version: 9762774944
Self Link: /apis/argoproj.io/v1alpha1/clusteranalysistemplates/server-rpc-error-rate
UID: d8b5886a-9fb1-11ec-9de9-02cbfa055fbb
Spec:
Args:
Name: service-name
Metrics:
Count: 3
Failure Condition: result[0] >= 10
Initial Delay: 70s
Interval: 75s
Name: server-rpc-error-rate
Provider:
Prometheus:
Address: http://prometheus-services.monitoring.svc.cluster.local
Query: service:server_rpc_errors_5xx_percent:rate1m{service="{{args.service-name}}", phase="canary"} or on() vector(0)
with a very simple canary setup:
Spec:
Replicas: 10
Selector:
Match Labels:
monzo.com/routing-name: service.foo
Strategy:
Canary:
Analysis:
Args:
Name: service-name
Value: service.user-context
Templates:
Cluster Scope: true
Template Name: server-rpc-error-rate
Cluster Scope: true
Template Name: rule-2
Cluster Scope: true
Template Name: rule-3
Cluster Scope: true
Template Name: rule-4
Canary Metadata:
Labels:
Phase: canary
Stable Metadata:
Labels:
Phase: stable
Steps:
Set Weight: 100
Pause:
Duration: 5m
Workload Ref:
API Version: apps/v1
Kind: Deployment
Name: s-foo
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
Did a single rollout have 6600 runs or was that the totatl number of runs within the namespace from many different rollouts?
Also just to inform you in case you were not aware you can control the number of runs to keep via
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: example-rollout-canary
spec:
# Number of desired pods.
# Defaults to 1.
replicas: 5
analysis:
# limits the number of successful analysis runs and experiments to be stored in a history
# Defaults to 5.
successfulRunHistoryLimit: 10
# limits the number of unsuccessful analysis runs and experiments to be stored in a history.
# Stages for unsuccessful: "Error", "Failed", "Inconclusive"
# Defaults to 5.
unsuccessfulRunHistoryLimit: 10
This of course dose not help with the performance case where there are multiple rollouts so need to look into that a bit more just to make sure rollouts is being as efficient as it can be.
Great question ! We're aware of those settings (which we will decrease in the meantime because it is of relatively low-value for us).
This is a total across all Rollouts. We had so many analysisRun object around because we have 2220 services and rollouts running in our cluster and I believe that why we're seeing this kind of niche performance issue.
More than happy to have a look at it, if you have some pointers :) ?
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue is stale because it has been open 60 days with no activity.
This issue is stale because it has been open 60 days with no activity.