results icon indicating copy to clipboard operation
results copied to clipboard

Epic: Production Metrics for Results

Open adambkaplan opened this issue 2 years ago • 7 comments

Feature Request

Add/expose Prometheus metrics that are useful for monitoring the Results apiserver and watchers.

This issue is intended to be an "epic" with linked sub-issues for specific metrics that Results should expose.

Notes

Originally posted by @adambkaplan in https://github.com/tektoncd/results/pull/294#discussion_r1071701477

adambkaplan avatar Jan 18 '23 22:01 adambkaplan

A few suggestions for relevant metrics in my opinion:

API server

  • Total number of errors to process requests.
  • Time taken to process each request.
  • Gorm related metrics (e.g. how long each query - maybe we could group the metric by GRPC operation - takes to complete).

Note: I am not familiar with the GRPC ecosystem, but I think that many of those metrics are already exposed out-of-the-box. So, we could confirm that and consider what else we need to instrument.

Watcher

  • Error rate.
  • How long the requests made to the GRPC server are taking to be returned.
  • How long the reconciliation loop is taking.
  • Total number of deleted objects.
  • Work queue metrics (e.g. lag).

Knative already exposes a few metrics about controllers. So, we could confirm if they're already in place and what else we need to instrument.

alan-ghelardi avatar Feb 16 '23 17:02 alan-ghelardi

/assign enarha

enarha avatar Mar 08 '23 13:03 enarha

I just started looking more seriously into this. One easy way to see what we currently export through the gRPC middleware is to create a port-forwarding to the tekton-results-api pod and 9090 port and use curl like curl 127.0.0.1:9090/metrics. It includes mostly total numbers which is not very helpful. The watcher also exports some metrics out of the box. I'll continue digging.

enarha avatar Mar 08 '23 13:03 enarha

/area roadmap

vdemeester avatar Mar 08 '23 13:03 vdemeester

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot avatar Jun 06 '23 14:06 tekton-robot

/remove-lifecycle stale

enarha avatar Jun 07 '23 09:06 enarha

/lifecycle frozen

This is a critical feature set.

adambkaplan avatar Jun 16 '23 15:06 adambkaplan