trino-gateway icon indicating copy to clipboard operation
trino-gateway copied to clipboard

Add cluster activation status metric and emit to v1/jmx

Open amybubu opened this issue 1 year ago • 4 comments

Description

As discussed in Issue #649 we would like to add a built-in activation status metric to Trino Gateway, to improve telemetry. This metric has a value of 1 if activated and a value of 0 if deactivated. The metric is populated with values polled directly from the database. Each cluster has it's own BackendClusterMetricsStats that can keep track of multiple metrics as we add in the future. When a backend cluster is added/deleted, an associated metric is registered/unregistered.

To keep this information up to date, the backends table in the DB is polled every 30 seconds to check if any changes have been made. We can't rely on triggering a metrics registry refresh within the add/delete backend methods because only one instance gets this API call. So if there are multiple instances, they won't all know to update their respective list of metrics. NOTE: Because these are periodically updated every 30 seconds, there may be a slight delay in new metrics showing up and old metrics disappearing. For deleted backends, their activation status metric value will be -1 until it is unregistered next refresh period.

Testing

  • built and ran locally

  • saw new metrics in /v1/jmx (NOTE: added testMetric to show proof of concept of adding other types of metrics, but removed from code before publishing PR) Screenshot 2025-04-07 at 5 38 33 PM

  • deactivated and activated cluster via UI and curl calls, and saw changes reflected in metrics after refreshing v1/jmx

  • added, updated, and deleted backends, and saw associated metrics added, updated and deleted respectively (with expected slight delay due to refresh period)

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required. ( x ) Release notes are required. Please propose a release note for me. ( ) Release notes are required, with the following suggested text:

* Fix some things. ({issue}`issuenumber`)

amybubu avatar Apr 15 '25 20:04 amybubu

LGTM!

andythsu avatar Apr 24 '25 00:04 andythsu

pinging @vishalya

andythsu avatar Apr 24 '25 18:04 andythsu

LGTM me as well. This exporter has it's own executor/thread which exports the metrics, would like to get an opinion from @oneonestar, if that's ok, We should try to avoid multiple of these threads.

vishalya avatar Apr 30 '25 13:04 vishalya

LGTM me as well. This exporter has it's own executor/thread which exports the metrics, would like to get an onion from @oneonestar, if that's ok, We should try to avoid multiple of these threads.

One thing to point out here -- the executor service isn't actually used to export the metrics. It's used to refresh the list of backends which are registered as having metrics (MBeanExporter#exportWithGeneratedName()/MBeanExporter#unexportWithGeneratedName()). Then the normal MBeanExporter thread handles the actual exporting of metrics.

For a little background on why this is needed, we started with just updating the map on any changes to the backends (add/delete/etc), but of course with a multi-Gateway setup this falls out of sync when changes are made on other instances. It's essentially the same problem we've discussed in other contexts; I think I can remember specifically that we have this split-brain with the health check logic.

Long-term, probably the right place to be in is that we have one central "state sync" which propagates any DB state changes back into the Gateway's local in-memory state, and any kind of triggers involved with state changes (including metrics registration) could happen there.

xkrogen avatar Apr 30 '25 14:04 xkrogen

@ebyhr - does this look good to you? I am fine merging it.

vishalya avatar Jun 09 '25 18:06 vishalya

ebyhr - does this look good to you? I am fine merging it.

Don't forget to squash commits 😁

Chaho12 avatar Jun 09 '25 23:06 Chaho12

@ebyhr @vishalya @Chaho12 @amybubu - can you please collaborate to figure out next step here

mosabua avatar Jun 18 '25 05:06 mosabua

Thank you @ebyhr @amybubu @Chaho12 and @vishalya

mosabua avatar Jun 19 '25 03:06 mosabua