High cardinality of metrics
Checks
- [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I am using charts that are officially provided
Controller Version
0.7.0
Deployment Method
Helm
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
- [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
- enable metrics to controller by using:
metrics: controllerManagerAddr: ":8080" listenerAddr: ":8080" listenerEndpoint: "/metrics"
- you need to have enough high traffic build environment that it will build a lot from different repos and prs
Describe the bug
Listeners contains metrics which is great. However, the cardinality of these metrics are just something that is not going to work. There needs to be way to disable high cardinality labels from the metrics.
so just during one hour we got over 15k new timeseries to our prometheus which is going to explode if we keep these enabled even 12 hours.
Describe the expected behavior
The expected behaviour is that high cardinality labels is REMOVED from the metrics. Also metrics buckets should be configurable in case of bucket metrics.
Lets take the worst metric gha_job_execution_duration_seconds_bucket
it do have job_workflow_ref which is basically almost always unique, which means that it will always create new timeseries to prometheus which is really expensive. Worth of thinking is even job_name needed. I would like to disable both of these labels. Also number of the default buckets are something which is going to explode our prometheus.
Additional Context
replicaCount: 2
flags:
logLevel: "debug"
metrics:
controllerManagerAddr: ":8080"
listenerAddr: ":8080"
listenerEndpoint: "/metrics"
Controller Logs
not relevant
Runner Pod Logs
not relevant
I have now disabled high cardinality metrics with relabelings. However, I would like to see removal of job_name and job_workflow_ref from all of these metrics even possibility to configure that. These might work in environment which do have like 10 builds per day but we do have more than thousand per hour.
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: listeners
labels:
app.kubernetes.io/part-of: gha-runner-scale-set
spec:
selector:
matchLabels:
app.kubernetes.io/part-of: gha-runner-scale-set
podMetricsEndpoints:
- port: metrics
relabelings:
- action: drop
regex: 'gha_job_(execution|startup)_duration_seconds'
- action: drop
regex: 'gha_completed_jobs_total|gha_started_jobs_total'
The labels on both gha_job_execution_duration_seconds and gha_job_startup_duration_seconds metrics mean that a new bucket is created for every run on every job, this means that every bucket will only ever contain a 0 or a 1. You cannot get meaningful information out of these metrics.
Prometheus is unable to aggregate metrics before applying rate() on them to produce histograms, so with the current layout of these metrics it is impossible to produce a histogram of startup or execution durations.
Information should be put into buckets based on job_name, organisation, and repository only. Highly unique labels such as runner_id, runner_name, and job_workflow_ref should be removed.
Also, it's very likely to get scrape errors after many builds:
"http://10.2.0.23:8080/metrics" exceeds -promscrape.maxScrapeSize=16777216
@nikola-jokic is there any possibility that this could end under development?
For my use case, this is definitely important metrics to expose. I'd love to have the metrics to be able to monitor the workflows properly, but the current setup makes it close to impossible.
If we remove the highly unique labels, and add a label for the name that would probably solve everything.