garm
garm copied to clipboard
garm webhook && metrics/o11y
Hello folks,
One the challenges about runners and github actions after years it's still observability.
I'd like to know if we have plans to work on o11y for garm's webhook. https://github.com/cloudbase/garm/blob/8f0d44742e3fcae1746b75899c132881b7b4ada1/apiserver/controllers/controllers.go#L98
Use case(s)
- If there's a stuck workflow because of a failed runner/provider. I know we have a timeout for bootstrap
- What's the P99/P90 for jobs&runners, startup time
- Get better insights about jobs. It should be possible to log/report about webhook events.
- Github actions doesn't provide a retry-mechanism. How do we cope with it?