dlt icon indicating copy to clipboard operation
dlt copied to clipboard

Allow GCS staging buckets for Databricks destination

Open 16bzwiener opened this issue 1 year ago • 2 comments

Feature description

The Databricks destination stages data in an external bucket before copying the data into Delta Lake. I'm working on GCP and need to use a GCS bucket for staging the data. I was able to get this working with a forked version of the repo.

Are you a dlt user?

Yes, I run dlt in production.

Use case

My company is migrating to Databricks on GCP. We have a dozen dlt pipelines in production that will need to be pointed to Databricks.

Proposed solution

Modify dlt/destinations/impl/databricks/databricks.py to allow a gs bucket_scheme makes its way to https://github.com/dlt-hub/dlt/blob/854905fb56576bc608b01b6b047208df888160a7/dlt/destinations/impl/databricks/databricks.py#L83 If there isn't a storage credential, still throw an exception saying GCS buckets do not work with temporary credentials.

Related issues

No response

16bzwiener avatar Oct 01 '24 02:10 16bzwiener

@16bzwiener OK that seems to be pretty easy. We can release a fix early next week. If you have nothing against hacking the existing installation of dlt to see if named credential work for you add gs to

AZURE_BLOB_STORAGE_PROTOCOLS = ["az", "abfss", "abfs"]

in databricks.py directly in the installed package.

rudolfix avatar Oct 01 '24 18:10 rudolfix

I was able to get it to work in my forked version here. It only worked if DESTINATION__DATABRICKS__STAGING_CREDENTIALS_NAME (or the equivalent toml secret) was defined.

16bzwiener avatar Oct 01 '24 22:10 16bzwiener