test-infra icon indicating copy to clipboard operation
test-infra copied to clipboard

Lacking Prow cluster metrics

Open howardjohn opened this issue 6 years ago • 1 comments

It would be useful to have metrics like

  • CPU/Memory usage, relative to requests/limits, so we can see if we need to tune these
  • How much capacity we have left, so we can see if we need to scale our cluster up/down. This is the main thing I want right now

Nice to haves would be seeing this per job or something.

Looking at stackdriver, it seems we can get the GCE stats of the underlying nodes, but I don't see the Kubernetes metrics there.

Its possible I am also just looking in the wrong place and we have these already

howardjohn avatar Aug 08 '19 22:08 howardjohn

What we have:

  • monitoring and alerting of Prow components: https://monitoring.prow.istio.io. These report to the Istio #test-alerts channel (and once https://github.com/istio/test-infra/pull/2610 goes through will report critical errors to #oncall channel)

  • CPU/Memory usage, relative to requests/limits (and other useful node and job metrics) provide by Stackdriver Prow dashboard. I am experimenting with various alerts here; once the alerts are properly tuned, I will push these to Slack as well.


What we do not have:

  • How much capacity we have left, so we can see if we need to scale our cluster up/down. I think this requires the GCP monitoring agent because I recall this data being unavailable.

clarketm avatar Apr 23 '20 00:04 clarketm