omni icon indicating copy to clipboard operation
omni copied to clipboard

[feature] Omni metrics on-premises

Open rothgar opened this issue 11 months ago • 7 comments

Problem Description

Customers need a way to gather metrics from their Omni instances and clusters. These metrics can include things like cluster, node, and provider counts and health

It should also include status on links (siderolink), and other data stored in Omni.

Solution

We should provide a metrics endpoint customers can scrape with something like prometheus.

Alternative Solutions

An alternative could be to add options for monitoring in existing services (eg datadog, zabbix), but this may be harder as those agents would need to run as side cars with the omni instance.

Notes

No response

rothgar avatar Jan 14 '25 18:01 rothgar

Spitballs of things that seem like they'd be easy-ish to add and something customers may care about below. Mostly just stuff we already have in our Omni Overview dashboard.

  • Siderolink up/down on machines
  • Number of machines connected to Omni
  • Clusters ready/not ready count
  • A count of clusters on version x.y.z of talos and/or k8s

rsmitty avatar Jan 14 '25 18:01 rsmitty

image

Basically these seem useful to expose to customers ^

rsmitty avatar Jan 14 '25 18:01 rsmitty

We could expose the prometheus metrics endpoint to the outside world by authenticating requests, so the clients can use service accounts for that.

We could probably add a new role, MetricsReader, which has less privileges than a Reader or something like that.

utkuozdemir avatar Jan 15 '25 10:01 utkuozdemir

It would be great if metrics could be exported to DataDog since all our other dashboards, monitoring and alerting is in DD so we can monitor edge cluster from the same monitoring solution and setup alerting etc.

diversit avatar Jan 30 '25 14:01 diversit

Are you running the datadog agent on the edge nodes?

I think Datadog would have to support scraping an Omni endpoint or you'd have to create a service account for that to work. It's been a while since I've used it and I'm not sure if that's something you can do in the agent.

rothgar avatar Jan 30 '25 23:01 rothgar

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jul 30 '25 02:07 github-actions[bot]

This is still something we should implement and document.

rothgar avatar Jul 30 '25 20:07 rothgar