trino icon indicating copy to clipboard operation
trino copied to clipboard

Fix catalog registration race upon trino startup

Open grantatspothero opened this issue 2 years ago • 6 comments

Description

Previously trino did not wait until catalogs were registered before announcing startup is complete. This led to a race where queries could be run and then fail with "No nodes available to run query" because catalogs were not registered yet.

Workaround this issue by explicitly waiting for catalogs to be registered before announcing startup is complete.

I found this issue during automated tests that launch trino clusters in docker, wait for the /v1/info endpoint to say that startup is complete, and then run queries.

Is this change a fix, improvement, new feature, refactoring, or other?

Fix.

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Trino server startup process.

How would you describe this change to a non-technical end user or system administrator?

Fix transient error when running queries immediately after trino startup.

Related issues, pull requests, and links

N/A

Documentation

(x) No documentation is needed. ( ) Sufficient documentation is included in this PR. ( ) Documentation PR is available with #prnumber. ( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required. ( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

grantatspothero avatar Jun 06 '22 18:06 grantatspothero

cc @electrum @sopel39 @arhimondr

findepi avatar Jun 06 '22 18:06 findepi

@dain @losipiuk any chance we can get an initial pass at reviewing this relatively soon? From a look at git blame, it seems like you two are the most involved in these files, and this PR has been waiting for a while.

colebow avatar Aug 08 '22 22:08 colebow

@grantatspothero do we still want this PR? I thought you worked around it in another way

dain avatar Feb 19 '23 19:02 dain

@grantatspothero do we still want this PR? I thought you worked around it in another way

We discussed over trino slack.

For others: this race still occurs but if one assumes all nodes are homogenous and run the same set of catalogs then the fix is much simpler.

Going to keep open since the race still occurs, but the better solution is probably only supporting running trino where all catalogs exist on all nodes.

grantatspothero avatar Feb 23 '23 20:02 grantatspothero

@grantatspothero is this still in progress? I think the assumption that all catalogs exist on all workers should be valid. Any other input @dain ?

mosabua avatar Jan 15 '24 20:01 mosabua