trino-gateway icon indicating copy to clipboard operation
trino-gateway copied to clipboard

Proposal to improve telemetry of Trino Gateway by tracking cluster activation status

Open amybubu opened this issue 1 year ago • 5 comments

Problem

Currently, there is no built-in metric within the Trino Gateway that consistently tracks the activation status of Trino clusters behind it. This means we have no historical data of what state the clusters are/have been in for continuous periods of time. Without this visibility, prolonged outages may go unnoticed, leading to delayed mitigation and reduced availability. An activation status metric on the Gateway would provide crucial insight into cluster behavior, enhancing both auditing and large-scale cluster management.

Motivation

As a backend level metric, this data would address a gap in the open-source Trino Gateway. The OSS metrics currently emitted by airlift primarily covers JVM and request level information. An activation status metric would introduce higher visibility into our backend clusters through the gateway, without needing to make an API call or look at the database table directly.

Image

At LinkedIn, we introduced an activation status metric into our Trino Gateway service and it’s become an essential tool for maintaining availability at scale. By integrating this metric into our alerting system, we receive timely notifications whenever a cluster remains deactivated for an unexpected amount of time. This has proved very beneficial to us, as we operate Trino at a large scale, serving thousands of weekly active users, executing over 5.5 million queries per week. Serving this amount of traffic requires operating 10+ clusters behind Trino Gateway. Additionally, LinkedIn fleet management infrastructure automatically deactivates and reactivates our clusters during deployment and maintenance procedures to maintain high availability, but that means we are not always aware when a cluster is being taken down. The alert has helped bring our attention to any failed automatic procedures that have unintentionally left clusters in a deactivated state, reducing time of mitigation from potentially days to just hours.

The metric's data has come in handy when investigating failing clusters, uncovering bugs, and enabling faster rollbacks. It has even simply reminded us to manually bring back up clusters when we forgot to re-activate after mitigation. No matter if it’s human error or an infrastructure glitch, this metric has helped us catch the issue early on, which is crucial as such capacity shortage means more queries have to be queued on the remaining clusters, slowing down user experience. We’ve also combined this metric with other alerts to improve operational excellence, and added logs that capture activation/deactivation calls, providing a clear paper trail for historical analysis and troubleshooting. We believe this activation status metric would benefit the entire Trino Gateway community, enabling more telemetry and better cluster management, faster triage, and improved auditing. We propose making this metric a standard feature in Trino Gateway, helping other users achieve the same reliability and operational improvements we’ve experienced.

amybubu avatar Mar 19 '25 17:03 amybubu

I think this sounds like a good idea. It would be good to have this in the database and also include visibility for the data in the UI ideally. But even just keeping it in the backend db would already be useful. Over time we might have to develop processes or tools to manage the amount of historical data.

We would happily work with you on reviewing a PR for this feature.

mosabua avatar Mar 22 '25 01:03 mosabua

@mosabua Thank you for the reply and glad to hear you think it's a good idea! Could you just clarify what you mean about having this data in the backend DB vs UI? Do you mean we should have a table where we add a row every time activate/deactivate is called?

amybubu avatar Mar 24 '25 17:03 amybubu

Well.. I would assume we want to keep track of the data somewhere to we can visualize it .. what did you propose where the status changes are recorded so that you can do that analysis .. or did you just want to have the events somehow emitted and then leave tracking to some other tool .. maybe via opentelemetry or whatever?

mosabua avatar Apr 03 '25 03:04 mosabua

Either way .. definitely a useful thing to add.

mosabua avatar Apr 03 '25 03:04 mosabua

When I wrote the proposal, I was thinking that we would add emitting the metric and then use another tool for tracking! Definitely considering using opentelemetry.

amybubu avatar Apr 07 '25 17:04 amybubu

Can we close this now that two related PRs are merged or do you have additional improvements in mind @amybubu ?

mosabua avatar Jun 19 '25 05:06 mosabua

Can we close this now that two related PRs are merged or do you have additional improvements in mind @amybubu ?

I think we're good to close! Thanks for your help

amybubu avatar Jun 20 '25 17:06 amybubu