actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

More runner and workflow metrics

Open ColinHeathman opened this issue 3 years ago • 15 comments

Is your feature request related to a problem? Please describe.

As an ARC operator, I want to be able to monitor and alert on the status of the runners I manage. It would be convenient if these metrics were bundled as a part of the ARC deployment instead of using another 3rd part adapter

A few things that would be interesting & actionable:

runner status & availability

  • runner startup times
  • runner status (idle, available, offline)

workflow status

  • queued workflows
  • in-progress workflows
  • run times
  • pass/fail stats

Describe the solution you'd like

ARC is well-positioned to aggregate and expose the data for these metrics.

Keeping up-to-date state for running Workflows would probably require a webhook, since the status can change faster than polling can keep up.

eg.

# HELP github_workflow_runtime Duration all workflow runs
# TYPE github_workflow_runtime histogram
github_workflow_runtime_bucket{conclusion="success", workflow_id="12345678",le="0.1"} 5
github_workflow_runtime_bucket{conclusion="success", workflow_id="12345678",le="1"} 5
github_workflow_runtime_bucket{conclusion="success", workflow_id="12345678",le="10"} 5
...
github_workflow_runtime_bucket{conclusion="success", workflow_id="12345678",le="+Inf"} 5
github_workflow_runtime_sum{conclusion="success", workflow_id="12345678"} 1000
github_workflow_runtime_count{conclusion="success", workflow_id="12345678"} 5

# HELP github_workflow_conclusion Conclusions of all github workflow
# TYPE github_workflow_conclusion counter
github_workflow_conclusion{conclusion="success", workflow_id="12345678"} 5

# HELP github_workflow_status Live status information
# TYPE github_workflow_status gauge
github_workflow_status{status="in_progress", workflow_id="12345678"} 5
github_workflow_status{status="queued", workflow_id="12345678"} 5

# HELP github_runner_status Live runner status information
# TYPE github_runner_status gauge
github_runner_status{status="idle", runner_id="4321"} 5
github_runner_status{status="active", runner_id="4321"} 5
github_runner_status{status="offline", runner_id="4321"} 5

# HELP github_runner_startup Duration all runner startups
# TYPE github_runner_startup histogram
github_runner_startup_bucket{runner_id="4321",le="0.1"} 5
github_runner_startup_bucket{runner_id="4321",le="1"} 5
github_runner_startup_bucket{runner_id="4321",le="10"} 5
...
github_runner_startup_bucket{runner_id="4321",le="+Inf"} 5
github_runner_startup_sum{runner_id="4321"} 100
github_runner_startup_count{runner_id="4321"} 5

Describe alternatives you've considered

https://github.com/Spendesk/github-actions-exporter is an available solution for getting some of these metrics, but it has a few problems:

  • It is not able to deal with ephemeral runners, and leads to invalid metrics
  • The polling model for workflows data can miss states that don't last long (eg. queuing, short-lived workflows)
  • The workflow metrics creates an infinitely growing list of metrics series

ColinHeathman avatar Jun 18 '22 00:06 ColinHeathman

Is this within the scope of ARC? If so I am willing to work on the feature

ColinHeathman avatar Jun 18 '22 00:06 ColinHeathman

@ColinHeathman Hey! Thanks for bringing this up.

Is this within the scope of ARC? If so I am willing to work on the feature

Yes, if and only if ARC is going to be the only viable solution to the problem.

ARC doesn't currently track workflow runs so I wonder how ARC can provide github_workflow_runtime_bucket better than github-actions-exporter.

github_workflow_status seems relatively fine, because it looks like almost like a snapshot of workflow run statuses that ARC might probably know about when the pull-based autoscaling metric is TotalNumberOfInProgressAndQueuedWorkflowRuns. But the same information can be observed by github-actions-exporter too. The same can be said for github_runner_status.

github_runner_startup_bucket is interesting. How are you going to measure the startup duration though? 🤔 AFAIK, we have no reliable way to detect when the runner has finished starting up. Did actions/runner had a hook that can be used to notify ARC about when it finished starting, or shall we follow logs from the runner container so that we can pattern-match against a message actions/runner writes once it finished starting?

mumoshu avatar Jun 18 '22 08:06 mumoshu

I am also interested in what alerting people use to make sure their self hosted github actions cluster is healthy.

cep21 avatar Jun 21 '22 21:06 cep21

I think github_workflow_status would be unreliable for any pull-based autoscaling metric. I think that it would be necessary to put the logic into the github-webhook-server and make it a prerequisite for getting workflow metrics. github-actions-exporter would have to be rewritten to support webhooks, which is why I think it might make sense to be a part of ARC.

github_runner_startup_bucket is more difficult. It might belong in a separate discussion. I would think that the best way to measure it would be to track the time between a pod creation, and when Github first sees the runner as either "Active" or "Idle" for the first time.

ColinHeathman avatar Jun 22 '22 01:06 ColinHeathman

One case I am interested in is comparing the number of active runners with the number of desired replicas of the runners to understand when I reach the max of replicas, I am having all the runners been used, or I still have some runners idling.

In my setup, I have the HorizontalRunnerAutoscaler setup to have min replicas of 2, and the scaleUpTrigger is the githubEvent workflowJiob

Moser-ss avatar Jun 22 '22 09:06 Moser-ss

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jul 23 '22 02:07 github-actions[bot]

Not stale

cep21 avatar Jul 23 '22 02:07 cep21

@cep21 What's the criteria for alerting folks of unhealthy runners for you? This feature request still seems to lack the exact use-case it should support?

mumoshu avatar Jul 24 '22 22:07 mumoshu

My use case:

I work in infrastructure. My team is responsible for the health and availability of the self hosted runners. I would like to proactively know if something is broken, rather than wait for teams to tell me the system is broken. Devops and infrastructure teams often use metrics for this kind of workflow, where they create alerts on these metrics. Here are a few metrics that will help me

  • How long does it take for a job to find a self hosted runner to run on
    • This is the most important one for me. It represents slowness in the self hosted runner setup.
  • Current (and maximum) number of auto scaled runners: so I can alert when that number equals the maximum size
  • Can self hosted runners log out the action job that they are running currently?
  • An alert or log or metric on how long it takes the controller to give me a pod for a runner
  • A metric on the number of successful (and failed) workflow runs
  • The time delta between the system asking for a self hosted runner, and a self hosted runner being up and available.

For infrastructure teams, it's often not enough to just deploy software: a critical part of the workflow is to setup alerting and metrics so we can know when things are broken or misbehaving.

cep21 avatar Jul 24 '22 23:07 cep21

@cep21 I hear you, but it looks like all of those metrics can be implemented outside of ARC by leveraging GitHub Actions API and workflow_job webhook events emitted by GitHub 🤔 Have you already surveyed other solutions like https://github.com/Spendesk/github-actions-exporter? I can also recommend reading https://www.cbui.dev/github-actions-self-hosted-runner-observability-and-monitoring/

mumoshu avatar Jul 24 '22 23:07 mumoshu

Is anyone working on this? If not, we should better close this as stale and encourage other folks to start a new project as a place for collaborations around metrics and monitoring. ARC should provide metrics about its own resources, not metrics across all the Actions related resources owned by GitHub, as they are two related but different beasts.

mumoshu avatar Jul 25 '22 00:07 mumoshu

I can give an example of one possible metric and a use case: Metric: The state of the runners created by ARC

  • Number of active, idle, and offline runners by label

And I think this is something that ARC already is checking when it is using the metric PercentageRunnersBusy in the Pull Driven Scaling and could make sense to have ARC , because ARC is creating the .runners

The use cases:

  1. Tracking usage of the runners, if I have a HorizontalRunnerAutoscaler that the tenx is ten and the desired value is already 10, I can think that my users already don't have enough runners, and I need to create more runners. But in reality, frotenthat 10 runners, I already have only 6 or 7 are in the active state
  2. Tracking the offline runners that are leftovers from ARC

If you approve of this feature, I don't mind working on that. As a workaround, I built a service that tracked a lot of data related to GitHub Actions and Runners.

Moser-ss avatar Jul 25 '22 18:07 Moser-ss

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Aug 25 '22 02:08 github-actions[bot]

I'm planning on making a PR to enable a new prometheus server on the githubwebhookserver runtime, and expose these metrics:

github_workflow_runtime (histogram)
github_workflow_queue (histogram)
github_workflow_conclusion (counter)
github_workflow_status (gauge)

Currently we are most interested in monitoring queue times. We have an elastic cluster, and custom startup logic that takes time so queue time is something we have had problems with in the past, and we need a solution to monitor this.

ColinHeathman avatar Aug 26 '22 21:08 ColinHeathman

@ColinHeathman Hey! Thanks for your help. Nit but I think the prefix should be github_workflow_job_ rather than just github_workflow_ as actually "workflow jobs" get scheduled onto and run by self-hosted runners, rather than "workflows".

The ideas of four metrics sound good!

Just my two cents- Implementing three histogram metrics can be a bit tricky, as you might need to track all the statuses of ongoing workflow jobs that are deduced from received workflow job events.

mumoshu avatar Aug 29 '22 01:08 mumoshu

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Sep 28 '22 02:09 github-actions[bot]

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Oct 29 '22 02:10 github-actions[bot]

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Dec 15 '22 02:12 github-actions[bot]