garm icon indicating copy to clipboard operation
garm copied to clipboard

add metrics for providers

Open pathcl opened this issue 1 year ago • 1 comments

We'd like to understand more about runner's && providers.

We have metrics for the GH API calls, but no metrics for provider calls. We don't currently see if a runner just failed to reach idle state and is just recreated over and over due to the bootstrap timeout.

Let's try to add metrics for provider calls.

pathcl avatar Jun 29 '24 08:06 pathcl

Hi @pathcl with https://github.com/cloudbase/garm/pull/217 i've also introduced metrics for the runner package (documentation: https://github.com/cloudbase/garm/blob/main/doc/config_metrics.md#runner-metrics)

we are already running a patched version of v0.1.4 where we cherry-picked some of the changes (and #217 is in there) we wanted on our side. (feel free to build our patched garm-version by your own and give them a try - all patches are already part of main branch in garm itself)

Out of curiosity: do you want to have more (from a metrics point of view) metrics or is this exactly what you are looking for?

image

promql-query:

    (
        sum by (operation, provider) (
          rate(
            garm_runner_errors_total{app_kubernetes_io_instance="garm-prod",app_kubernetes_io_name="garm"}[5m]
          )
        )
      or
        sum by (operation, provider) (
            garm_runner_operations_total{app_kubernetes_io_instance="garm-prod",app_kubernetes_io_name="garm"}
          *
            0
        )
    )
  /
    sum by (operation, provider) (
      rate(
        garm_runner_operations_total{app_kubernetes_io_instance="garm-prod",app_kubernetes_io_name="garm"}[5m]
      )
    )
*
  100

bavarianbidi avatar Jul 22 '24 10:07 bavarianbidi