[feature] Omni metrics on-premises
Problem Description
Customers need a way to gather metrics from their Omni instances and clusters. These metrics can include things like cluster, node, and provider counts and health
It should also include status on links (siderolink), and other data stored in Omni.
Solution
We should provide a metrics endpoint customers can scrape with something like prometheus.
Alternative Solutions
An alternative could be to add options for monitoring in existing services (eg datadog, zabbix), but this may be harder as those agents would need to run as side cars with the omni instance.
Notes
No response
Spitballs of things that seem like they'd be easy-ish to add and something customers may care about below. Mostly just stuff we already have in our Omni Overview dashboard.
- Siderolink up/down on machines
- Number of machines connected to Omni
- Clusters ready/not ready count
- A count of clusters on version x.y.z of talos and/or k8s
Basically these seem useful to expose to customers ^
We could expose the prometheus metrics endpoint to the outside world by authenticating requests, so the clients can use service accounts for that.
We could probably add a new role, MetricsReader, which has less privileges than a Reader or something like that.
It would be great if metrics could be exported to DataDog since all our other dashboards, monitoring and alerting is in DD so we can monitor edge cluster from the same monitoring solution and setup alerting etc.
Are you running the datadog agent on the edge nodes?
I think Datadog would have to support scraping an Omni endpoint or you'd have to create a service account for that to work. It's been a while since I've used it and I'm not sure if that's something you can do in the agent.
This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This is still something we should implement and document.