Allow GCS staging buckets for Databricks destination
Feature description
The Databricks destination stages data in an external bucket before copying the data into Delta Lake. I'm working on GCP and need to use a GCS bucket for staging the data. I was able to get this working with a forked version of the repo.
Are you a dlt user?
Yes, I run dlt in production.
Use case
My company is migrating to Databricks on GCP. We have a dozen dlt pipelines in production that will need to be pointed to Databricks.
Proposed solution
Modify dlt/destinations/impl/databricks/databricks.py to allow a gs bucket_scheme makes its way to https://github.com/dlt-hub/dlt/blob/854905fb56576bc608b01b6b047208df888160a7/dlt/destinations/impl/databricks/databricks.py#L83
If there isn't a storage credential, still throw an exception saying GCS buckets do not work with temporary credentials.
Related issues
No response
@16bzwiener OK that seems to be pretty easy. We can release a fix early next week. If you have nothing against hacking the existing installation of dlt to see if named credential work for you add gs to
AZURE_BLOB_STORAGE_PROTOCOLS = ["az", "abfss", "abfs"]
in databricks.py directly in the installed package.
I was able to get it to work in my forked version here. It only worked if DESTINATION__DATABRICKS__STAGING_CREDENTIALS_NAME (or the equivalent toml secret) was defined.