dbt-bigquery
dbt-bigquery copied to clipboard
[Feature] Support Python Stored procedure as a way to submit python models
Is this your first time submitting a feature request?
- [X] I have read the expectations for open source contributors
- [X] I have searched the existing issues, and I could not find an existing issue for this feature
- [X] I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion
Describe the feature
Big Query support running Spark as stored procedure. Link We can add a new submission method in dbt-bigquery to run python models leveraging it.
Basically dbt would still do jinja templating as current but when submitting the job, it will create/replace a stored procedure, and call it.
There are also some ways to get output from the stored procedure that we might be able to integrate with
Describe alternatives you've considered
No response
Who will this benefit?
No response
Are you interested in contributing this feature?
No response
Anything else?
No response
I am trying to follow the tutorial here to create a python stored procedure, and found that I got stuck trying to to create a connection. On page https://cloud.google.com/bigquery/docs/working-with-connections#bq_1, I am not sure how to create the connection I need to use in stored prodedure.
UPDATE:
- found https://cloud.google.com/bigquery/docs/connect-to-spark and created a connection, let's see if it works.
- successfully created the stored procedure, running it, been 4 mins and still executing
- job failed, realized connection do not have permission, found the service account associated with the connection, granted permission and rerunning
- still trying to figure out the right permission to get it working
- still
- It do not seems like I can find the generated service account and give it access to things
Based on this experience, I feel like it is even more opaque than dataproc serverless, where it took less time to setup.
Talked to BQ team yesterday(thanks) and got a walk-through of how to give access to the the connection account created by BQ. Will pick this up soon!
hi @ChenyuLInx draft some code here https://github.com/dbt-labs/dbt-bigquery/compare/main...velascoluis:dbt-bigquery:spark-stored-procedure Can you have a look perhaps?
End to end setup:
Spark stored procedures as a dbt python submission method example run
Setup
Execute the following from a Terminal, Cloud Shell for instance:
Setup environment variables, for example:
export GOOGLE_CLOUD_PROJECT=spark-dbt-test #Change with your own project
export REGION=us-central1 #Change with your own region
Create bucket to stage code and serve as a SPARK staging bucket:
gcloud storage buckets create gs://${GOOGLE_CLOUD_PROJECT}-bucket --location=${REGION}
Create a dataset on BigQuery:
bq mk --location=${REGION} `echo "${GOOGLE_CLOUD_PROJECT}" | tr '-' '_'`_dataset
Create a new service account, it will be used as the main dbt identity:
gcloud iam service-accounts create ${GOOGLE_CLOUD_PROJECT}-sa
Grant permissions on the service account:
#Execute BQ jobs
gcloud projects add-iam-policy-binding ${GOOGLE_CLOUD_PROJECT} --member=serviceAccount:${GOOGLE_CLOUD_PROJECT}-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com --role=roles/bigquery.user
#Use bigquery.connections.delegate
gcloud projects add-iam-policy-binding ${GOOGLE_CLOUD_PROJECT} --member=serviceAccount:${GOOGLE_CLOUD_PROJECT}-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com --role=roles/bigquery.connectionAdmin
#Acces staging bucket
gcloud storage buckets add-iam-policy-binding gs://${GOOGLE_CLOUD_PROJECT}-bucket --member=serviceAccount:${GOOGLE_CLOUD_PROJECT}-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com --role=roles/storage.admin
--Create tables on the dataset
GRANT `roles/bigquery.dataEditor`
ON SCHEMA `${GOOGLE_CLOUD_PROJECT}`.`echo '${GOOGLE_CLOUD_PROJECT}' | tr '-' '_'`_dataset
TO 'serviceAccount:${GOOGLE_CLOUD_PROJECT}-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com'
#Avoid Requesting user cannot act as service account or cross-project service account usage is disabled
gcloud projects add-iam-policy-binding ${GOOGLE_CLOUD_PROJECT} --member=serviceAccount:${GOOGLE_CLOUD_PROJECT}-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com --role=roles/iam.serviceAccountUser
Create a spark external connection:
bq mk --connection --location=${REGION} --project_id=${GOOGLE_CLOUD_PROJECT} --connection_type=SPARK ${GOOGLE_CLOUD_PROJECT}-conn
Get the service account and grant permissions:
CONNECTION_SA=`bq show --format=prettyjson --connection ${GOOGLE_CLOUD_PROJECT}.${REGION}.${GOOGLE_CLOUD_PROJECT}-conn | jq -r ."spark"."serviceAccountId"`
#Access pyspark code in bucket
gcloud storage buckets add-iam-policy-binding gs://${GOOGLE_CLOUD_PROJECT}-bucket --member=serviceAccount:${CONNECTION_SA} --role=roles/storage.admin
Create a JSON key for the service account:
gcloud iam service-accounts keys create ${HOME}/${GOOGLE_CLOUD_PROJECT}-sa.json --iam-account=${GOOGLE_CLOUD_PROJECT}-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com
dbt core installation
Install dbt from source
git clone https://github.com/velascoluis/dbt-bigquery
cd dbt-bigquery
git checkout spark-stored-procedure
pip3 install .
Init a new dbt project:
cd
dbt init
Enter values on the interactive shell:
- Enter dbt project name. e.g. dbt_sp_velascoluis
- Select bigquery as database [1]
- Select service account as authentication method [2]
- Enter path of the JSON keyfile
- Enter GCP project name. e.g. dbt-sp-velascoluis
- Enter BQ dataset e.g. dbt_sp_velascoluis_dataset
- Leave threads to 1
- Leave timeout to 300
- Select location e.g. US
Edit dbt profile.yml under $HOME/.dbt , and add gcs_bucket and spark_external_connection_name, it should looks like this:
spark_dbt_test:
outputs:
dev:
dataset: spark_dbt_test_dataset
job_execution_timeout_seconds: 300
job_retries: 1
keyfile: /home/admin_/spark-dbt-test-sa.json
location: us-central1
method: service-account
priority: interactive
project: spark-dbt-test
threads: 1
type: bigquery
gcs_bucket: spark-dbt-test-bucket
spark_external_connection_name: spark-dbt-test-conn
target: dev
NOTE: Additionally you can specify submission_method: stored_procedure, which is the default now
Run a dbt debug to make sure everything its working ok:
cd dbt_spark_stored_procedure
dbt debug
Run a sample pipeline with python code
Add a new python file under project_folder/models/example for example my_third_python_model.py:
import pandas as pd
def model(dbt, session):
second_model = dbt.ref("my_second_dbt_model")
df = second_model.toPandas()
return df
Run the pipeline from the project root:
cd dbt_spark_stored_procedure
dbt run
@velascoluis the change set looks great! I will test it locally today. Do you want to add some test to get a PR going or do you want us to take it over?
@velascoluis Are there any performance improvements from using Python Stored procs from using Data Procs? I'm trying to understand the differences in the experience and why an user would choose one over the other