argo-cd
argo-cd copied to clipboard
Application Controller no longer publishes workqueue_depth metrics
Checklist:
- [x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
- [x] I've included steps to reproduce the bug.
- [x] I've pasted the output of
argocd version
.
Describe the bug
Application Controller no longer publishes workqueue_depth metrics. (ApplicationSet Controller still does, as expected, and other metrics are produced as expected.)
To Reproduce
Install ArgoCD 2.9 into the argocd namespace using the official helm release with metrics collection configured.
Expected behavior
workqueue_depth metrics should be published for Application Controller.
Screenshots
Version
argocd@argocd-server-99d565b8-4wh7w:~$ argocd version
argocd: v2.9.0+9cf0c69
BuildDate: 2023-11-06T04:43:50Z
GitCommit: 9cf0c69bbe70393db40e5755e34715f30179ee09
GitTreeState: clean
GoVersion: go1.21.3
Compiler: gc
Platform: linux/amd64
FATA[0000] Argo CD server address unspecified
argocd@argocd-server-99d565b8-4wh7w:~$
Logs
n/a
In case it is useful, here are my settings from values.yaml
(our helm chart depends on the ArgoCD helm chart, thus the argo-cd
top-level key):
argo-cd:
redis-ha:
enabled: true
image:
repository: public.ecr.aws/docker/library/redis
exporter:
enabled: true
serviceMonitor:
enabled: true
podDisruptionBudget:
maxUnavailable: 1
haproxy:
resources:
requests:
memory: '130Mi'
cpu: '0.06'
metrics:
enabled: true
serviceMonitor:
enabled: true
controller:
replicas: 1
# We need increased resources in the Application Controller to handle our high number of applications.
resources:
requests:
memory: '15Gi'
cpu: '10'
metrics:
enabled: true
serviceMonitor:
enabled: true
pdb:
enabled: true
maxUnavailable: '1'
# We don't want ArgoCD to be allowed to delete namespaces in any environment.
clusterRoleRules:
enabled: true
rules: # REDACTED
server:
autoscaling:
enabled: true
minReplicas: 3
resources:
requests:
memory: '2Gi'
cpu: '1.2'
metrics:
enabled: true
serviceMonitor:
enabled: true
pdb:
enabled: true
maxUnavailable: '1'
certificate: # REDACTED
service:
# We use a LoadBalancer service here instead of ingress because our ingress controller
# is deployed via ArgoCD. We don't want a problem with ingress to make ArgoCD unavailable.
type: LoadBalancer
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
volumes:
- name: argocd-custom-js
configMap:
name: argocd-custom-js
volumeMounts:
- name: argocd-custom-js
mountPath: /tmp/extensions/argocd-custom-js/
repoServer:
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 12
metrics:
enabled: true
serviceMonitor:
enabled: true
pdb:
enabled: true
maxUnavailable: '1'
env:
- name: ARGOCD_EXEC_TIMEOUT
value: 7m
- name: ARGOCD_GRPC_MAX_SIZE_MB
value: '250'
resources:
requests:
memory: '2330Mi'
cpu: '1'
ephemeral-storage: '10Gi'
volumes:
- name: gitea-tls-volume
secret:
secretName: gitea-tls
volumeMounts:
- name: gitea-tls-volume
mountPath: /etc/ssl/certs/gitea-tls.pem
subPath: ca.crt
applicationSet:
replicaCount: 2
metrics:
enabled: true
serviceMonitor:
enabled: true
pdb:
enabled: true
maxUnavailable: '1'
configs:
styles: # REDACTED
cm:
accounts.readonly: apiKey
accounts.applications-readonly: apiKey
admin.enabled: false
# Require sign-in every morning to avoid session expirations in the middle of the day.
# A late sign-in at 5pm will expire at 7am the next day.
users.session.duration: 14h
# reconcile git and kubernetes every 10m (default 3m)
timeout.reconciliation: 10m
# regenerate helm every hour
timeout.hard.reconciliation: 1h
# track resources that belong to an application with annotation instead of label
# this is argocd.argoproj.io/tracking-id
application.resourceTrackingMethod: annotation
# configures project inheritance
globalProjects: |-
- labelSelector:
matchExpressions:
- key: parent-project
operator: In
values:
- global-project
projectName: global-project
# ignore aggregated clusterroles when diffing https://github.com/argoproj/argo-cd/pull/3076
resource.compareoptions: |
ignoreAggregatedRoles: true
dex.config: # REDACTED
params:
otlp.address: otel-agent-opentelemetry-collector-agent.honeycomb:4317
# 50 status and 25 operation is benchmark for 1000 apps
controller.status.processors: 100
controller.operation.processors: 50
controller.repo.server.timeout.seconds: 270
controller.self.heal.timeout.seconds: 60
rbac:
policy.default: role:safe-readonly
policy.csv: # REDACTED
I have same problem.
Not only workqueue_depth
, I cannnot get all of workqueue_*
metrics, like workqueue_adds
, workqueue_longest_running_processor_seconds
.
Argo CD: v2.9.0 Helm cahrt: v5.51.0
The problem still exists in ArgoCD: v2.9.2
Still broken in v2.9.3
I believe these metrics get registered here: https://github.com/argoproj/argo-cd/blob/master/controller/metrics/metrics.go#L163
As far as I can tell nothing has really changed in here between 2.8 and 2.9 recently beyond the new Alpha Sharding feature.
We're also affected by this. Might this be the issue: https://github.com/argoproj/argo-cd/pull/15480? It was released in 2.9.0 and it touches the area.
Related to https://github.com/argoproj/argo-cd/issues/12241 and https://github.com/argoproj/argo-cd/pull/8318