astro-sdk
astro-sdk copied to clipboard
Native GCS to Databricks DeltaTable autoloader dependent on pre-set credentials in the cluster
Describe the bug Currently, the native transfer between GCS and Databricks Delta Table relies on pre-configuration on the Databricks Cluster. The current credentials set within the Astro Python SDK 1.5.2 (and our tests) are insufficient.
Version
- Astro Python SDK: 1.5.2
To Reproduce
Remove the following Spark settings from the Databricks cluster:
spark.hadoop.fs.gs.auth.service.account.emailspark.hadoop.fs.gs.project.idspark.hadoop.google.cloud.auth.service.account.enablespark.hadoop.fs.gs.auth.service.account.private.keyspark.hadoop.fs.gs.auth.service.account.private.id
Try to run the test:
pytest tests_integration/databases/databricks_tests/test_load.py::test_delta_load_file_azure_wasb[delta-azure_blob_storage]`
See it failing:
IllegalArgumentException: clientEmail must be set if using credentials configured directly in configuration.
---------------------------------------------------------------------------
(...)
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
200 # Hide where the exception came from that shows a non-Pythonic
201 # JVM exception message.
--> 202 raise converted from None
203 else:
204 raise
IllegalArgumentException: clientEmail must be set if using credentials configured directly in the configuration.
Expected behaviour Without any pre-configured configuration on the Databricks cluster, we should be able to transfer natively from GCS to Databricks, by using the information contained in the Airflow connection. Assuming that is not possible, we should find a way of having the test set up with all the necessary credentials, in a way that it does not rely on pre-configured credentials in the cluster.