oso
oso copied to clipboard
Trino for distributed queries
What is it?
In the future, we may have data spread out among a bunch of places (e.g. BigQuery, Clickhouse, Postgres, random files, IPFS). Trino seems like an interesting option for running distributed queries https://trino.io/
Looking at the docs, this is pretty interesting, you can setup data connectors to run
- Queries over BigQuery storage API
- Forwarding queries to a Clickhouse or Snowflake instance
Then join it all together in a unified interface. Will be useful if our data is actually across a bunch of locations
In case it helps, Starbust is probably the best managed offering!
Apparently you can run Trino on GCP DataProc! that surprised me https://cloud.google.com/dataproc/docs/tutorials/trino-dataproc
For reference, dbt-trino is useful if we want to replace BQ in our data pipeline https://github.com/starburstdata/dbt-trino