argo-rollouts icon indicating copy to clipboard operation
argo-rollouts copied to clipboard

High CPU usage with many old analysisRun objects

Open pallamidessi opened this issue 2 years ago • 2 comments

Hello 👋 !

We've came across some interesting performance/scalability issue: the argo-rollouts controller does use quite a bit of CPU when there are many old (non-active) analysisRun object in the cluster. Deleting them fix the performance issue completely

The performance issue wasn't too much of a problem to be honest but the resource usage is surprising (especially for "inactive" objects)

Checklist:

  • [X] I've included steps to reproduce the bug.
  • [X] I've inclued the version of argo rollouts.

Describe the bug

With 6600 analysisRun (and 0 active), the controller was using an average of about ~1.5 cores and spiked to ~15 whereas after deleting old of them we were back at 0.2 cores average with spike at ~1.7 cores. See attached screenshot

I haven't yet to look at the controller codes but I suspect the controller still list them and publish metrics/do so kind of processing that is quite costly Screenshot Screenshot 2022-07-27 at 16 04 36

To Reproduce

Create many analysisRun objects and look at the CPU usage

Expected behavior

The CPU usage of the controller should be increase drastically with the number of old, inactive, analysisRun objects

Version

Argo-rollout version: v1.1.1

Definition We're using some relatively simple ClusterAnalysisTemplate such as:

Name:         server-rpc-error-rate
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  argoproj.io/v1alpha1
Kind:         ClusterAnalysisTemplate
Metadata:
  Creation Timestamp:  2022-03-09T14:04:33Z
  Generation:          1
  Resource Version:    9762774944
  Self Link:           /apis/argoproj.io/v1alpha1/clusteranalysistemplates/server-rpc-error-rate
  UID:                 d8b5886a-9fb1-11ec-9de9-02cbfa055fbb
Spec:
  Args:
    Name:  service-name
  Metrics:
    Count:              3
    Failure Condition:  result[0] >= 10
    Initial Delay:      70s
    Interval:           75s
    Name:               server-rpc-error-rate
    Provider:
      Prometheus:
        Address:  http://prometheus-services.monitoring.svc.cluster.local
        Query:    service:server_rpc_errors_5xx_percent:rate1m{service="{{args.service-name}}", phase="canary"} or on() vector(0)

with a very simple canary setup:

Spec:
  Replicas:  10
  Selector:
    Match Labels:
      monzo.com/routing-name:  service.foo
  Strategy:
    Canary:
      Analysis:
        Args:
          Name:   service-name
          Value:  service.user-context
        Templates:
          Cluster Scope:  true
          Template Name:  server-rpc-error-rate
          Cluster Scope:  true
          Template Name:  rule-2
          Cluster Scope:  true
          Template Name:  rule-3
          Cluster Scope:  true
          Template Name:  rule-4
      Canary Metadata:
        Labels:
          Phase:  canary
      Stable Metadata:
        Labels:
          Phase:  stable
      Steps:
        Set Weight:  100
        Pause:
          Duration:  5m
  Workload Ref:
    API Version:  apps/v1
    Kind:         Deployment
    Name:         s-foo

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

pallamidessi avatar Jul 27 '22 15:07 pallamidessi

Did a single rollout have 6600 runs or was that the totatl number of runs within the namespace from many different rollouts?

Also just to inform you in case you were not aware you can control the number of runs to keep via

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: example-rollout-canary
spec:
  # Number of desired pods.
  # Defaults to 1.
  replicas: 5
  analysis:
    # limits the number of successful analysis runs and experiments to be stored in a history
    # Defaults to 5.
    successfulRunHistoryLimit: 10
    # limits the number of unsuccessful analysis runs and experiments to be stored in a history. 
    # Stages for unsuccessful: "Error", "Failed", "Inconclusive"
    # Defaults to 5.
    unsuccessfulRunHistoryLimit: 10

This of course dose not help with the performance case where there are multiple rollouts so need to look into that a bit more just to make sure rollouts is being as efficient as it can be.

zachaller avatar Jul 27 '22 18:07 zachaller

Great question ! We're aware of those settings (which we will decrease in the meantime because it is of relatively low-value for us).

This is a total across all Rollouts. We had so many analysisRun object around because we have 2220 services and rollouts running in our cluster and I believe that why we're seeing this kind of niche performance issue.

More than happy to have a look at it, if you have some pointers :) ?

pallamidessi avatar Jul 28 '22 12:07 pallamidessi

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Oct 16 '22 04:10 github-actions[bot]

This issue is stale because it has been open 60 days with no activity.

github-actions[bot] avatar Dec 17 '22 02:12 github-actions[bot]

This issue is stale because it has been open 60 days with no activity.

github-actions[bot] avatar Feb 20 '23 02:02 github-actions[bot]