garm icon indicating copy to clipboard operation
garm copied to clipboard

garm webhook && metrics/o11y

Open pathcl opened this issue 1 year ago • 3 comments

Hello folks,

One the challenges about runners and github actions after years it's still observability.

I'd like to know if we have plans to work on o11y for garm's webhook. https://github.com/cloudbase/garm/blob/8f0d44742e3fcae1746b75899c132881b7b4ada1/apiserver/controllers/controllers.go#L98

Use case(s)

  • If there's a stuck workflow because of a failed runner/provider. I know we have a timeout for bootstrap
  • What's the P99/P90 for jobs&runners, startup time
  • Get better insights about jobs. It should be possible to log/report about webhook events.
  • Github actions doesn't provide a retry-mechanism. How do we cope with it?

pathcl avatar Jun 26 '24 20:06 pathcl