trino-gateway icon indicating copy to clipboard operation
trino-gateway copied to clipboard

[Proposal] Add gateway_settings table to enable real-time gateway toggles

Open felicity3786 opened this issue 6 months ago • 4 comments

Problem statement

Developers sometimes need to quickly change routing behavior in production without a deploy (e.g., temporarily ignore health checks to re-route traffic during an incident). Today, the gateway computes health and routability automatically; there is no supported way to override it in real time across all gateway instances.

Example and practical use case at Linkedin:

  • Temporarily disable health-based routing decisions (global “kill switch”).
  • Adjust health thresholds (e.g., required healthy worker percentage) at runtime.
  • A path to support future feature toggles (e.g., choose routing manager implementation, rule-engine mode) without adding bespoke config flags or redeploying.

Goals

  • Provide a minimal, generic runtime settings table shared by all gateway instances.
  • Expose a small admin API + simple UI toggle for oncall.
  • Safe by default: reads are cached, writes are audited, and feature flags have clear defaults.
  • Backward compatible: if the Settings table is absent, gateway behaves exactly as today.

We can add a new gateway_settings table which might look like

CREATE TABLE IF NOT EXISTS gateway_settings (
  setting_key   VARCHAR(128)  NOT NULL PRIMARY KEY,
  setting_value VARCHAR(1024) NOT NULL,
  scope_type    ENUM('GLOBAL','ROUTING_GROUP','CLUSTER') DEFAULT 'GLOBAL',
  scope_id      VARCHAR(128)  NULL,
  updated_at    TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  updated_by    VARCHAR(128)  NOT NULL
);

And provide some API for admin to update like:

GET  /v1/admin/settings
    → 200 { "health.monitor.enabled":"true", "health.required.minHealthyWorkerPercent":"80" }

Also add some associated facilitated UX feature.

cc @xkrogen

felicity3786 avatar Oct 17 '25 22:10 felicity3786

We need to consider how these interact with the normal YAML configs. The codebase generally assumes that the configs are immutable, so I don't think we can use this approach to change any arbitrary config. So how do we express/expose which values can be changed at runtime? For settings that allow runtime update, do we use the same name as the applicable config, or should there be a separate namespace for these (similar to configs vs session properties in Trino)? Do classes/modules have some way to "register" which settings keys they support so that we can perform validation on the key name on updates? Do classes/modules have to perform periodic fetch of the settings table, or do they get a callback to inform them of relevant updates?

I like the concept but feel we need to put a little more thought into where/how this will be applied

xkrogen avatar Oct 17 '25 22:10 xkrogen

I've got few questions regarding your proposal.

  1. When you mean by all gateway instances, do you have multiple replicas with proxy server infront? A single update in Active state (via db or api) should be applied to all instances. We have 16 replicas to handle proxy and it works well when doing Blue-Green deployment.
  2. Is the reason why you introduce min-health-worker percent is because if some workers die and cluster is unstable, you would like to exclude it from routing? We have Healtch check by worker count or GC count or any other metrics etc. Please check docs. cc @andythsu
  3. I don't mind storing the immutable configs you mentioned in DB if there is a specific cause, but for now i don't quite see it. Does this change have to be done in real time? I mean changing routing manager and rules engine would change everything on how Gateway works and it seems very dangerous.

Chaho12 avatar Oct 19 '25 23:10 Chaho12

Good points. @xkrogen

So how do we express/expose which values can be changed at runtime?

Yes I think this should only apply to a set of keys not all configs. And the precedence would be runtime(db) will override the yaml file. We can give them runtime./settings. to live under a dedicated section in the file.

Do classes/modules have some way to "register" which settings keys they support so that we can perform validation on the key name on updates? Do classes/modules have to perform periodic fetch of the settings table, or do they get a callback to inform them of relevant updates?

I hope we can make it work like JMX module, have the guice handling the binding and register the settings(validation & discovery), then a central manager like SettingsManager polls onces, then pushes callbacks to subscribers.

felicity3786 avatar Oct 21 '25 18:10 felicity3786

I've got few questions regarding your proposal.

  1. When you mean by all gateway instances, do you have multiple replicas with proxy server infront? A single update in Active state (via db or api) should be applied to all instances. We have 16 replicas to handle proxy and it works well when doing Blue-Green deployment.
  2. Is the reason why you introduce min-health-worker percent is because if some workers die and cluster is unstable, you would like to exclude it from routing? We have Healtch check by worker count or GC count or any other metrics etc. Please check docs. cc @andythsu
  3. I don't mind storing the immutable configs you mentioned in DB if there is a specific cause, but for now i don't quite see it. Does this change have to be done in real time? I mean changing routing manager and rules engine would change everything on how Gateway works and it seems very dangerous.

Thanks @Chaho12

  1. Yes that's what I meant we have multiple replica so we want to update the db to be the SoT all the replica sees.
  2. Sorry it wasn't so clear on this, we are using the similar health check based on metrics, but in practice sometimes we want to quickly adjust the threshold, like during some SEV incident we have seen cluster losing workers and become unroutable. If we have a contingency way to dynamic update the threshold or override the worker count based health check would be very helpful.
  3. Totally fair concern, I feel we can start by only scoping runtime settings to operational overrides only (e.g., runtime.health.enabled, runtime.health.minHealthyWorkerPercent). Structural configuration such as routing manager or rules engine remains immutable at runtime and stays in YAML (or, optionally, can be mirrored to DB as read-only/pending values that only take effect on restart). A typed registry whitelists which keys are mutable; writes to unknown keys are rejected. A central SettingsManager polls once and pushes callbacks to consumers, so classes don’t poll. Hard safety rails (e.g., zero-worker exclusion) always apply.

felicity3786 avatar Oct 21 '25 18:10 felicity3786