actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Invalid values for the metrics gha_registered_runners and gha_idle_runners in ghalistener

Open verdel opened this issue 1 year ago • 8 comments

Checks

  • [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
  • [X] I am using charts that are officially provided

Controller Version

0.9.2

Deployment Method

Helm

Checks

  • [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

-

Describe the bug

We have a GitHub Action that runs once a day. A special type of runners is allocated specifically for it. During the execution of the GitHub Action, we receive the latest batch of messages about task execution. In this message, the statistics.totalIdleRunners and statistics.totalRegisteredRunners contain non-zero values.

These values are published by the controller as a prometheus metrics. After this last message, the metric values do not change until the next runner execution the following day.

Is it possible to fix this behavior, or does it require changes on the GitHub side?

Describe the expected behavior

The value of the Prometheus metrics ghalistener should reflect the actual state of the runners.

Additional Context

-

Controller Logs

2024-05-26T07:24:29Z	INFO	listener-app.listener	Getting next message	{"lastMessageID": 1089}
2024-05-26T07:24:37Z	INFO	listener-app.listener	Processing message	{"messageId": 1090, "messageType": "RunnerScaleSetJobMessages"}
2024-05-26T07:24:37Z	INFO	listener-app.listener	New runner scale set statistics.	{"statistics": {"totalAvailableJobs":0,"totalAcquiredJobs":3,"totalAssignedJobs":3,"totalRunningJobs":3,"totalRegisteredRunners":4,"totalBusyRunners":3,"totalIdleRunners":0}}
2024-05-26T07:24:37Z	INFO	listener-app.listener	Job completed message received.	{"RequestId": 669571, "Result": "succeeded", "RunnerId": 83622, "RunnerName": "terraform-drift-checker-hxppm-runner-9q2hx"}
2024-05-26T07:24:37Z	INFO	listener-app.listener	Deleting last message	{"lastMessageID": 1090}
2024-05-26T07:24:38Z	INFO	listener-app.worker.kubernetesworker	Calculated target runner count	{"assigned job": 3, "decision": 3, "min": 0, "max": 30, "currentRunnerCount": 3, "jobsCompleted": 1}
2024-05-26T07:24:38Z	INFO	listener-app.worker.kubernetesworker	Compare	{"original": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":-1,\"patchID\":-1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}", "patch": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":3,\"patchID\":6917,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}"}
2024-05-26T07:24:38Z	INFO	listener-app.worker.kubernetesworker	Preparing EphemeralRunnerSet update	{"json": "{\"spec\":{\"patchID\":6917,\"replicas\":3}}"}
2024-05-26T07:24:38Z	INFO	listener-app.worker.kubernetesworker	Ephemeral runner set scaled.	{"namespace": "github-actions-runner", "name": "terraform-drift-checker-hxppm", "replicas": 3}
2024-05-26T07:24:38Z	INFO	listener-app.listener	Getting next message	{"lastMessageID": 1090}
2024-05-26T07:24:50Z	INFO	listener-app.listener	Processing message	{"messageId": 1091, "messageType": "RunnerScaleSetJobMessages"}
2024-05-26T07:24:50Z	INFO	listener-app.listener	New runner scale set statistics.	{"statistics": {"totalAvailableJobs":0,"totalAcquiredJobs":0,"totalAssignedJobs":0,"totalRunningJobs":0,"totalRegisteredRunners":2,"totalBusyRunners":0,"totalIdleRunners":1}}
2024-05-26T07:24:50Z	INFO	listener-app.listener	Job completed message received.	{"RequestId": 669572, "Result": "succeeded", "RunnerId": 83625, "RunnerName": "terraform-drift-checker-hxppm-runner-6dmvl"}
2024-05-26T07:24:50Z	INFO	listener-app.listener	Job completed message received.	{"RequestId": 669573, "Result": "succeeded", "RunnerId": 83623, "RunnerName": "terraform-drift-checker-hxppm-runner-lcc6k"}
2024-05-26T07:24:50Z	INFO	listener-app.listener	Job completed message received.	{"RequestId": 669574, "Result": "succeeded", "RunnerId": 83624, "RunnerName": "terraform-drift-checker-hxppm-runner-d2bv7"}
2024-05-26T07:24:50Z	INFO	listener-app.listener	Deleting last message	{"lastMessageID": 1091}

Runner Pod Logs

-

verdel avatar May 26 '24 13:05 verdel

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

github-actions[bot] avatar May 26 '24 13:05 github-actions[bot]

Hey @verdel,

You are right, we receive an empty batch if no activity is needed, so the metric would be incorrect when the cluster becomes idle. Ideally, to reflect the correct metric, the changes should be made on the API side. However, we can optimistically set this metric to the desired count when the cluster becomes idle. Let me discuss it with the team, and I'll get back to you with more information :relaxed:

nikola-jokic avatar May 30 '24 10:05 nikola-jokic

Hello team, we would also be interested in this fix. Do you have an update on this by any chance?

nicolas-laduguie avatar Aug 12 '24 09:08 nicolas-laduguie

Not sure if this is exactly related, but we have issue with the gha_idle_runners metrics that stays at zero even if the ui shows 20+ runners idle. Would be nice to have this metric working to be able to set some alerts.

glaberge avatar Oct 08 '24 19:10 glaberge

I'm observing the same thing. The value of gha_idle_runners seems to always be 0 even with many idle runners.

mikespharss avatar Oct 09 '24 22:10 mikespharss

We're facing issues to monitor our runners due to this, and it have been open for a while

arcezd avatar Nov 22 '24 23:11 arcezd

0.10.1, gha_idle_runners is stale with some historical number until listener restart. I was worried there were unexpected number of idle runners, but was not the case.

After restart:

Image

Maybe it's not wrong, meaning, I might needed to wait a bit for listener to pick up the state of idle runners. Either way, this seems to be correct metric. Previously we were seeing 0 all the time. But again, same values are stale. Mixed bag.

alen-z avatar Mar 07 '25 17:03 alen-z

Any updates on this?

seahorseing-around avatar Nov 11 '25 22:11 seahorseing-around