trino icon indicating copy to clipboard operation
trino copied to clipboard

Use BigQuery Read API for reading external BigLake tables

Open marcinsbd opened this issue 1 year ago • 2 comments

Description

Continuation of the https://github.com/trinodb/trino/pull/21017

BigQuery storage APIs support reading BigLake external tables (ie external tables with a connection). But the current implementation uses views which can be expensive, because it requires Trino issuing a SQL query against BigQuery. This PR adds support to read BigLake tables directly using the storage API.

There are no behavior changes for external tables and BQ native tables - they use the view and storage APIs respectively. Added a new test for BigLake tables.

Additional context and related issues

Fixes https://github.com/trinodb/trino/issues/21016 https://cloud.google.com/bigquery/docs/biglake-intro

Release notes

(x) This is not user-visible or is docs only, and no release notes are required. ( ) Release notes are required. Please propose a release note for me. ( ) Release notes are required, with the following suggested text:

# BigQuery
* Improve performance when reading external BigLake tables. ({issue}`21016`)

marcinsbd avatar Aug 07 '24 16:08 marcinsbd

I will squash commits once when I'll make sure that all changes have been properly cherry-picked and rebase with master.

marcinsbd avatar Aug 08 '24 11:08 marcinsbd

@marcinsbd Can we update the PR description

Praveen2112 avatar Sep 10 '24 07:09 Praveen2112

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

github-actions[bot] avatar Nov 07 '24 17:11 github-actions[bot]

/test-with-secrets sha=5c2fb2886f53965852b06d5f465b7deef089de5e

electrum avatar Nov 19 '24 22:11 electrum

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/11923270255

github-actions[bot] avatar Nov 19 '24 22:11 github-actions[bot]

I think we should handle the following limitation:

https://cloud.google.com/bigquery/docs/biglake-intro

The BigQuery Storage API is not available in other cloud environments, such as AWS and Azure.

ebyhr avatar Nov 22 '24 07:11 ebyhr

@ebyhr But for Azure or AKS we could use BigLake Omni right or should we use a flag to control them ?

Praveen2112 avatar Nov 26 '24 09:11 Praveen2112

@Praveen2112 I'm not sure how BigLake Omni works in this case. How about adding another condition to BigQueryMetadata#isBigLakeTable? e.g.

externalTableDefinition.getSourceUris().stream().allMatch(uri -> uri.startsWith("gs://")

ebyhr avatar Nov 26 '24 10:11 ebyhr

Hi @anoopj what would be the correct way to check if a TableDefinition describes BigLake Table (external table with connection id) or not? is there a simple way using api to distinguish BigLakeTable from ObjectTable and Omni Table ?

marcinsbd avatar Nov 29 '24 23:11 marcinsbd

what would be the correct way to check if a TableDefinition describes BigLake Table (external table with connection id) or not? Not sure if I understood the question: but a BigLake table will have the connection ID in the table's ExternalDataConfiguration

is there a simple way using api to distinguish BigLakeTable from ObjectTable and Omni Table ?

You can tell from the dataset region (preferred) or looking at the sourceUris as mentioned in the above comment.

anoopj avatar Dec 02 '24 22:12 anoopj

Thanks @ebyhr, @Praveen2112, @anoopj, @krvikash, @pajaks for the review and your help. AC

marcinsbd avatar Dec 04 '24 11:12 marcinsbd

Please let's do the another round of review @ebyhr, @Praveen2112, @krvikash.

marcinsbd avatar Dec 10 '24 10:12 marcinsbd

@marcinsbd Can we rebase the PR

Praveen2112 avatar Dec 10 '24 11:12 Praveen2112