tuva icon indicating copy to clipboard operation
tuva copied to clipboard

Issue with loading terminology seeds from S3 on Databricks

Open sarahmcmorgan opened this issue 2 years ago • 1 comments

Describe the bug Bug reported by someone on Slack. They are not able to use the post-hook macro to load seeds from S3 into their Databricks warehouse.

Struggling a bit to load terminologies to DBX w/ Unity (deets here). If I were to just load the terminologies manually, what should I comment out in the DBT project to make sure the terminology tables are not overwritten?

This turned out to be simple. For posterity: If using a "shared access" cluster or a SQL Serverless warehouse (same idea), the s3 copies fail b/c it's not possible to set the environment variables w/ Tuva bucket keys. Everything works fine on a "single user cluster", where the user can set environment variables.

I think any config changes would likely have to occur on DBX side, i.e. registering the s3 keys up front in the cluster configuration or registering an "external volume". For me, the confusion originated from dbt-databricks docs which recommend running on the SQL Warehouse product b/c it's easy to monitor t and debug the queries DBT generates. But I'm doubtful you can connect directly to s3 on the warehouse product without mediating via "Unity Catalog" (by design)

https://www.databricks.com/blog/admin-isolation-shared-clusters

sarahmcmorgan avatar Apr 25 '24 22:04 sarahmcmorgan

Still no luck here, but it's fin. We can load the seeds on a single user cluster, then schedule all the SQL against the warehouse (you can run tasks on separate clusters in a databricks job.).

But it did occur to me...

The common denominator w/ everyone running this project is python. You need it set up dbt, therefore python to run Tuva.

Perhaps there is an approach to getting s3 seeds up via boto3, then stream the inserts statements into w/e warehouse DBT is connected to? I'm eyeing, but haven't explored below... https://docs.getdbt.com/docs/build/python-models

cc: @sarah-tuva

dr00b avatar May 11 '24 13:05 dr00b