actions-runner-controller High cardinality of metrics

Checks

[X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
[X] I am using charts that are officially provided

Controller Version

0.7.0

Deployment Method

Helm

Checks

[X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
[X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

enable metrics to controller by using:

metrics: controllerManagerAddr: ":8080" listenerAddr: ":8080" listenerEndpoint: "/metrics"

you need to have enough high traffic build environment that it will build a lot from different repos and prs

Describe the bug

Listeners contains metrics which is great. However, the cardinality of these metrics are just something that is not going to work. There needs to be way to disable high cardinality labels from the metrics.

so just during one hour we got over 15k new timeseries to our prometheus which is going to explode if we keep these enabled even 12 hours.

Describe the expected behavior

The expected behaviour is that high cardinality labels is REMOVED from the metrics. Also metrics buckets should be configurable in case of bucket metrics.

Lets take the worst metric gha_job_execution_duration_seconds_bucket

it do have job_workflow_ref which is basically almost always unique, which means that it will always create new timeseries to prometheus which is really expensive. Worth of thinking is even job_name needed. I would like to disable both of these labels. Also number of the default buckets are something which is going to explode our prometheus.

Additional Context

replicaCount: 2

flags:
  logLevel: "debug"

metrics:
  controllerManagerAddr: ":8080"
  listenerAddr: ":8080"
  listenerEndpoint: "/metrics"

Controller Logs

not relevant

Runner Pod Logs

not relevant

Dec 14 '23 16:12 zetaab

I have now disabled high cardinality metrics with relabelings. However, I would like to see removal of job_name and job_workflow_ref from all of these metrics even possibility to configure that. These might work in environment which do have like 10 builds per day but we do have more than thousand per hour.

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: listeners
  labels:
    app.kubernetes.io/part-of: gha-runner-scale-set
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: gha-runner-scale-set
  podMetricsEndpoints:
  - port: metrics
    relabelings:
    - action: drop
      regex: 'gha_job_(execution|startup)_duration_seconds'
    - action: drop
      regex: 'gha_completed_jobs_total|gha_started_jobs_total'

Dec 14 '23 16:12 zetaab

The labels on both gha_job_execution_duration_seconds and gha_job_startup_duration_seconds metrics mean that a new bucket is created for every run on every job, this means that every bucket will only ever contain a 0 or a 1. You cannot get meaningful information out of these metrics.

Prometheus is unable to aggregate metrics before applying rate() on them to produce histograms, so with the current layout of these metrics it is impossible to produce a histogram of startup or execution durations.

Information should be put into buckets based on job_name, organisation, and repository only. Highly unique labels such as runner_id, runner_name, and job_workflow_ref should be removed.

May 28 '24 15:05 thomassandslyst

Also, it's very likely to get scrape errors after many builds:

"http://10.2.0.23:8080/metrics" exceeds -promscrape.maxScrapeSize=16777216

Jun 11 '24 07:06 notz

@nikola-jokic is there any possibility that this could end under development?

Jun 11 '24 09:06 zetaab

For my use case, this is definitely important metrics to expose. I'd love to have the metrics to be able to monitor the workflows properly, but the current setup makes it close to impossible.

If we remove the highly unique labels, and add a label for the name that would probably solve everything.

Jul 17 '24 15:07 realmunk