grafana-operator icon indicating copy to clipboard operation
grafana-operator copied to clipboard

Grafana Operator Internal Dashboard

Open 1naboki1 opened this issue 2 years ago • 6 comments

Is your feature request related to a problem? Please describe.

No

Describe the solution you'd like

After some searching for a public dashboard for the grafana operator internal metrics, i was not able to find one.

Is there any already existing?

Describe alternatives you've considered

Building one (when there is some spare time)

1naboki1 avatar Dec 15 '21 08:12 1naboki1

We haven't built any dashboard for the grafana-operator. But we would love PR:s in this area :)

NissesSenap avatar Dec 16 '21 07:12 NissesSenap

I was playing around with potential dashboards here and came up with the following.

I imported this dashboard for controller-runtime metrics https://grafana.com/grafana/dashboards/12122 It didn't show anything initially, but with a change to the 'instance' variable and the queries to use the service label instead, it might give a good starting point. (label_values(controller_runtime_active_workers, service)) Note that there's nothing grafana operator specific in the dashboard.

image

It took quite a bit of setup to get to this point from zero, so there's probably some readme updates I can make for giving permission to prometheus to scrape grafana operator metrics (via the rbac proxy), setting up the data source, and an example servicemonitor.

Will wait for feedback before creating a PR.

david-martin avatar Jan 19 '22 13:01 david-martin

I really like this approach, did you test this with multiple grafana operators in the same cluster?

In my use case i provide several grafana instances with their own operator (unfortunatly no multiple instances support from Operator).

Since i have workwise now a bit more time i will try to take this approach and combine it with the grafana metrics board.

1naboki1 avatar Jan 27 '22 16:01 1naboki1

did you test this with multiple grafana operators in the same cluster

No, I haven't. That might require another template variable for namespace

Do follow up with how your approach goes

david-martin avatar Feb 21 '22 10:02 david-martin

Hey @1naboki1 , did you get some time to combine this with the grafana metrics board?

david-martin avatar Apr 05 '22 11:04 david-martin

We are currently working on a dashboard for the Operator. We would like to define some goals and requirements that are useful for assessing the operator state.

Goals:

  1. assess the health of the Operator (not the Operand): 1.1. figure out if other processes (or users) are interfering with the Operator (this could cause resources to constantly change and restore). 1.2. Memory and CPU usage over time. 1.3. Number of garbage collections happening. If this number goes up, it could indicate that the Operator is struggling for memory. 1.4. Network traffic from the Operator to Grafana and failed requests. This could indicate that Grafana itself is not healthy.

  2. Overview of managed resources 2.1. get an overview of all Operands (how many Grafana instances, dashboards, data sources and folders is the operator managing and where are they located). This might require new metrics or additional labels in existing metrics.

Does that sounds reasonable?

ping @ThirumlaDevi

pb82 avatar Jul 26 '23 11:07 pb82