argo-cd Application Controller no longer publishes workqueue

Checklist:

[x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
[x] I've included steps to reproduce the bug.
[x] I've pasted the output of argocd version.

Describe the bug

Application Controller no longer publishes workqueue_depth metrics. (ApplicationSet Controller still does, as expected, and other metrics are produced as expected.)

To Reproduce

Install ArgoCD 2.9 into the argocd namespace using the official helm release with metrics collection configured.

Expected behavior

workqueue_depth metrics should be published for Application Controller.

Screenshots

Version

argocd@argocd-server-99d565b8-4wh7w:~$ argocd version
argocd: v2.9.0+9cf0c69
  BuildDate: 2023-11-06T04:43:50Z
  GitCommit: 9cf0c69bbe70393db40e5755e34715f30179ee09
  GitTreeState: clean
  GoVersion: go1.21.3
  Compiler: gc
  Platform: linux/amd64
FATA[0000] Argo CD server address unspecified
argocd@argocd-server-99d565b8-4wh7w:~$

Logs

n/a

Nov 13 '23 23:11 taliastocks

In case it is useful, here are my settings from values.yaml (our helm chart depends on the ArgoCD helm chart, thus the argo-cd top-level key):

argo-cd:
  redis-ha:
    enabled: true

    image:
      repository: public.ecr.aws/docker/library/redis

    exporter:
      enabled: true
      serviceMonitor:
        enabled: true

    podDisruptionBudget:
      maxUnavailable: 1

    haproxy:
      resources:
        requests:
          memory: '130Mi'
          cpu: '0.06'

      metrics:
        enabled: true
        serviceMonitor:
          enabled: true

  controller:
    replicas: 1

    # We need increased resources in the Application Controller to handle our high number of applications.
    resources:
      requests:
        memory: '15Gi'
        cpu: '10'

    metrics:
      enabled: true
      serviceMonitor:
        enabled: true

    pdb:
      enabled: true
      maxUnavailable: '1'

    # We don't want ArgoCD to be allowed to delete namespaces in any environment.
    clusterRoleRules:
      enabled: true
      rules:  # REDACTED

  server:
    autoscaling:
      enabled: true
      minReplicas: 3

    resources:
      requests:
        memory: '2Gi'
        cpu: '1.2'

    metrics:
      enabled: true
      serviceMonitor:
        enabled: true

    pdb:
      enabled: true
      maxUnavailable: '1'

    certificate:  # REDACTED

    service:
      # We use a LoadBalancer service here instead of ingress because our ingress controller
      # is deployed via ArgoCD. We don't want a problem with ingress to make ArgoCD unavailable.
      type: LoadBalancer
      annotations:
        service.beta.kubernetes.io/aws-load-balancer-type: external
        service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip

    volumes:
      - name: argocd-custom-js
        configMap:
          name: argocd-custom-js

    volumeMounts:
      - name: argocd-custom-js
        mountPath: /tmp/extensions/argocd-custom-js/

  repoServer:
    autoscaling:
      enabled: true
      minReplicas: 2
      maxReplicas: 12

    metrics:
      enabled: true
      serviceMonitor:
        enabled: true

    pdb:
      enabled: true
      maxUnavailable: '1'

    env:
      - name: ARGOCD_EXEC_TIMEOUT
        value: 7m
      - name: ARGOCD_GRPC_MAX_SIZE_MB
        value: '250'

    resources:
      requests:
        memory: '2330Mi'
        cpu: '1'
        ephemeral-storage: '10Gi'

    volumes:
      - name: gitea-tls-volume
        secret:
          secretName: gitea-tls

    volumeMounts:
      - name: gitea-tls-volume
        mountPath: /etc/ssl/certs/gitea-tls.pem
        subPath: ca.crt

  applicationSet:
    replicaCount: 2

    metrics:
      enabled: true
      serviceMonitor:
        enabled: true

    pdb:
      enabled: true
      maxUnavailable: '1'

  configs:
    styles:  # REDACTED

    cm:
      accounts.readonly: apiKey
      accounts.applications-readonly: apiKey
      admin.enabled: false

      # Require sign-in every morning to avoid session expirations in the middle of the day.
      # A late sign-in at 5pm will expire at 7am the next day.
      users.session.duration: 14h

      # reconcile git and kubernetes every 10m (default 3m)
      timeout.reconciliation: 10m

      # regenerate helm every hour
      timeout.hard.reconciliation: 1h

      # track resources that belong to an application with annotation instead of label
      # this is argocd.argoproj.io/tracking-id
      application.resourceTrackingMethod: annotation

      # configures project inheritance
      globalProjects: |-
        - labelSelector:
            matchExpressions:
              - key: parent-project
                operator: In
                values:
                  - global-project
          projectName: global-project

      # ignore aggregated clusterroles when diffing https://github.com/argoproj/argo-cd/pull/3076
      resource.compareoptions: |
        ignoreAggregatedRoles: true

      dex.config:  # REDACTED

    params:
      otlp.address: otel-agent-opentelemetry-collector-agent.honeycomb:4317

      # 50 status and 25 operation is benchmark for 1000 apps
      controller.status.processors: 100
      controller.operation.processors: 50
      controller.repo.server.timeout.seconds: 270
      controller.self.heal.timeout.seconds: 60

    rbac:
      policy.default: role:safe-readonly
      policy.csv:  # REDACTED

Nov 13 '23 23:11 taliastocks

I have same problem. Not only workqueue_depth, I cannnot get all of workqueue_* metrics, like workqueue_adds, workqueue_longest_running_processor_seconds.

Argo CD: v2.9.0 Helm cahrt: v5.51.0

Nov 15 '23 07:11 pf-siedler

The problem still exists in ArgoCD: v2.9.2

Nov 24 '23 09:11 jon-rei

Still broken in v2.9.3

Dec 06 '23 22:12 sidewinder12s

I believe these metrics get registered here: https://github.com/argoproj/argo-cd/blob/master/controller/metrics/metrics.go#L163

As far as I can tell nothing has really changed in here between 2.8 and 2.9 recently beyond the new Alpha Sharding feature.

Dec 06 '23 23:12 sidewinder12s

We're also affected by this. Might this be the issue: https://github.com/argoproj/argo-cd/pull/15480? It was released in 2.9.0 and it touches the area.

Dec 15 '23 00:12 moleskin-smile

Related to https://github.com/argoproj/argo-cd/issues/12241 and https://github.com/argoproj/argo-cd/pull/8318

Jan 26 '24 19:01 agaudreault

argo-cd argo-cd copied to clipboard

Application Controller no longer publishes workqueue_depth metrics

argo-cd
argo-cd copied to clipboard