dbx icon indicating copy to clipboard operation
dbx copied to clipboard

dbx does not use credential passthrough

Open mathurk1 opened this issue 1 year ago • 0 comments

Expected Behavior

I am working with Azure Databricks. I have a cluster with credential passthrough which allows me to read data stored in ADLS gen2 using my own id. I can simply log into databricks workspace, attach a notebook to the cluster and query the delta tables from ADLS gen2 without any setup.

I would expect that when I submit dbx execute --cluster-id cluster123 --job jobABC to the same cluster, it should be able to read those datasets from ADLS gen2 using my ID.

Thanks!

Current Behavior

Currently, the job fails when I dbx execute a job to the cluster with the following error:

Py4JJavaError: An error occurred while calling o469.load.
: com.databricks.backend.daemon.data.client.adl.AzureCredentialNotFoundException: Could not find ADLS Gen2 Token
        at com.databricks.backend.daemon.data.client.adl.AdlGen2UpgradeCredentialContextTokenProvider.$anonfun$getToken$1(AdlGen2UpgradeCredentialContextTokenProvider.scala:37)
        at scala.Option.getOrElse(Option.scala:189)
        at com.databricks.backend.daemon.data.client.adl.AdlGen2UpgradeCredentialContextTokenProvider.getToken(AdlGen2UpgradeCredentialContextTokenProvider.scala:31)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAccessToken(AbfsClient.java:1371)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:306)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.completeExecute(AbfsRestOperation.java:238)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.lambda$execute$0(AbfsRestOperation.java:211)
        at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation(IOStatisticsBinding.java:464)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:209)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:1213)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:1194)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getIsNamespaceEnabled(AzureBlobFileSystemStore.java:437)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFileStatus(AzureBlobFileSystemStore.java:1107)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus(AzureBlobFileSystem.java:901)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus(AzureBlobFileSystem.java:891)

From my understanding, it is expecting a service principal or storeage keys to be configured

Steps to Reproduce (for bugs)

  1. clone charming aurora repo - https://github.com/gstaubli/dbx-charming-aurora
  2. setup dbx configure --token to setup link with databricks workspace
  3. add a new job to the conf/deployment.yml file:
      - name: "my-test-job"
        spark_python_task:
          python_file: "file://charming_aurora/tasks/sample_etl_task.py"
          parameters: [ "--conf-file", "file:fuse://conf/tasks/sample_etl_config.yml" ]
  1. update the sample etl task to read a adls delta table - https://github.com/gstaubli/dbx-charming-aurora/blob/main/charming_aurora/tasks/sample_etl_task.py
    def _write_data(self):
        df = (
            self.spark.read.format("delta")
            .load(
                f"abfss://[email protected]/path/to/table"
            )
            .filter(f.col("date") == "2024-01-01")
        )
        print(df.count())
  1. submit job - dbx execute --cluster-id=cluster-id-with-credential-passthrough --job my-test-job

Context

I want to specifically "dbx execute" to my interactive cluster and not create a job cluster.

Your Environment

  • dbx version used: 0.8.18
  • Databricks Runtime version: 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)

mathurk1 avatar Apr 26 '24 22:04 mathurk1