Add cluster activation status metric and emit to v1/jmx
Description
As discussed in Issue #649 we would like to add a built-in activation status metric to Trino Gateway, to improve telemetry. This metric has a value of 1 if activated and a value of 0 if deactivated. The metric is populated with values polled directly from the database. Each cluster has it's own BackendClusterMetricsStats that can keep track of multiple metrics as we add in the future. When a backend cluster is added/deleted, an associated metric is registered/unregistered.
To keep this information up to date, the backends table in the DB is polled every 30 seconds to check if any changes have been made. We can't rely on triggering a metrics registry refresh within the add/delete backend methods because only one instance gets this API call. So if there are multiple instances, they won't all know to update their respective list of metrics. NOTE: Because these are periodically updated every 30 seconds, there may be a slight delay in new metrics showing up and old metrics disappearing. For deleted backends, their activation status metric value will be -1 until it is unregistered next refresh period.
Testing
-
built and ran locally
-
saw new metrics in /v1/jmx (NOTE: added testMetric to show proof of concept of adding other types of metrics, but removed from code before publishing PR)
-
deactivated and activated cluster via UI and curl calls, and saw changes reflected in metrics after refreshing v1/jmx
-
added, updated, and deleted backends, and saw associated metrics added, updated and deleted respectively (with expected slight delay due to refresh period)
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required. ( x ) Release notes are required. Please propose a release note for me. ( ) Release notes are required, with the following suggested text:
* Fix some things. ({issue}`issuenumber`)
LGTM!
pinging @vishalya
LGTM me as well. This exporter has it's own executor/thread which exports the metrics, would like to get an opinion from @oneonestar, if that's ok, We should try to avoid multiple of these threads.
LGTM me as well. This exporter has it's own executor/thread which exports the metrics, would like to get an onion from @oneonestar, if that's ok, We should try to avoid multiple of these threads.
One thing to point out here -- the executor service isn't actually used to export the metrics. It's used to refresh the list of backends which are registered as having metrics (MBeanExporter#exportWithGeneratedName()/MBeanExporter#unexportWithGeneratedName()). Then the normal MBeanExporter thread handles the actual exporting of metrics.
For a little background on why this is needed, we started with just updating the map on any changes to the backends (add/delete/etc), but of course with a multi-Gateway setup this falls out of sync when changes are made on other instances. It's essentially the same problem we've discussed in other contexts; I think I can remember specifically that we have this split-brain with the health check logic.
Long-term, probably the right place to be in is that we have one central "state sync" which propagates any DB state changes back into the Gateway's local in-memory state, and any kind of triggers involved with state changes (including metrics registration) could happen there.
@ebyhr - does this look good to you? I am fine merging it.
ebyhr - does this look good to you? I am fine merging it.
Don't forget to squash commits 😁
@ebyhr @vishalya @Chaho12 @amybubu - can you please collaborate to figure out next step here
Thank you @ebyhr @amybubu @Chaho12 and @vishalya