Lacking Prow cluster metrics
It would be useful to have metrics like
- CPU/Memory usage, relative to requests/limits, so we can see if we need to tune these
- How much capacity we have left, so we can see if we need to scale our cluster up/down. This is the main thing I want right now
Nice to haves would be seeing this per job or something.
Looking at stackdriver, it seems we can get the GCE stats of the underlying nodes, but I don't see the Kubernetes metrics there.
Its possible I am also just looking in the wrong place and we have these already
What we have:
-
monitoring and alerting of Prow components: https://monitoring.prow.istio.io. These report to the Istio #test-alerts channel (and once https://github.com/istio/test-infra/pull/2610 goes through will report critical errors to #oncall channel)
-
CPU/Memory usage, relative to requests/limits (and other useful node and job metrics) provide by Stackdriver Prow dashboard. I am experimenting with various alerts here; once the alerts are properly tuned, I will push these to Slack as well.
What we do not have:
- How much capacity we have left, so we can see if we need to scale our cluster up/down. I think this requires the GCP monitoring agent because I recall this data being unavailable.