trino-gateway icon indicating copy to clipboard operation
trino-gateway copied to clipboard

Feature request - Retry query in another Trino cluster

Open Chaho12 opened this issue 1 year ago • 5 comments
trafficstars

An optional feature to support retrying another backend url (idk maybe to default one?) if a query is requested to gateway, but the backend url it routed dies and/or there is no response from that trino server.

Chaho12 avatar Jan 25 '24 00:01 Chaho12

we have had issues with the failed status check with backend response, this will cause the backend being set to inactive

siminyou avatar Feb 13 '24 23:02 siminyou

Adding support for retrying (https://github.com/trinodb/trino-gateway/issues/268) sounds better. Let me close this issue.

ebyhr avatar Mar 14 '24 22:03 ebyhr

This is different from #268. #268 is about gateway <-> database connection issue This is about gateway <-> Trino connection issue. I think this is a good to have feature.

oneonestar avatar Mar 18 '24 10:03 oneonestar

Yes I meant for gateway <-> Trino connection issue when trino cluster goes down(or internal error?), an optional feature for gateway to retry in another backend rather than returning FAILED/timeout ERROR to user. Ultimately, it would be great if HA is 100% supported. This is more complex feature than simply routing that Gateway does.

Chaho12 avatar Mar 18 '24 11:03 Chaho12

Generally +1 for this feature, though it essentially requires re-implementing Trino FTE's QUERY-level retries inside of the Gateway, including all of the complexity that comes along with that, such as having to spool the results that have been returned to the client so far. This also implies that the GW must be used as a full proxy, streaming all results, rather than letting clients communicate with the coordinator directly to fetch results.

That being said, I think it is highly worthwhile. Adding all of the aforementioned infrastructure will also enable other features like dark canary-style deployments where a copy of each query (or a sampling of them) is sent to a deployment which is in the progress of being validated, and the GW can perform comparisons between the "prod" copy of the query vs the "test" copy. This is a super powerful mechanism that can validate new deployments in a very robust manner (for both functional parity as well as performance).

xkrogen avatar May 28 '24 18:05 xkrogen