spark-excel icon indicating copy to clipboard operation
spark-excel copied to clipboard

[BUG] Using SAS token for auth attempts to use the account key

Open psee-code opened this issue 3 years ago • 4 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

Attempting to use the excel reader with configurations for the SAS token authentication throws an error relating to the account key setting.

Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51) at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:556) at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1695) at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:217) at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:132) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at com.crealytics.spark.excel.WorkbookReader$.readFromHadoop$1(WorkbookReader.scala:60) at com.crealytics.spark.excel.WorkbookReader$.$anonfun$apply$4(WorkbookReader.scala:79) at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$3(WorkbookReader.scala:102) at scala.Option.fold(Option.scala:251) at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:102) at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:33) at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:32) at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:87) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:48) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:48) at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:121) at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:120) at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:189) at scala.Option.getOrElse(Option.scala:189) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:188) at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:52) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:52) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:24) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:390) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:444) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:400) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:400) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:287) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:295) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:748)

Expected Behavior

When spark is configured to use the SAS token it should use the related settings rather than the account key.

Steps To Reproduce

from pyspark.sql import SparkSession

sas_token = "SAS_TOKEN"

spark = SparkSession.builder.enableHiveSupport().getOrCreate()

spark.conf.set("fs.azure.account.auth.type.STORAGE_ACCOUNT.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.STORAGE_ACCOUNT.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.STORAGE_ACCOUNT.dfs.core.windows.net", sas_token)

test_url = "abfss://CONTAINER@STORAGE_ACCOUNT.dfs.core.windows.net/FILE.xlsx"
test_df = spark.read.format("com.crealytics.spark.excel") \
  .option("header", "false") \
  .option("inferSchema", "true") \
  .load(test_url)
test_df.show()

Environment

- Spark version: 3.1.2
- Spark-Excel version: 0.18.3 (com.crealytics:spark-excel_2.12:3.1.2_0.18.3)
- Databricks on Azure

Anything else?

Using the code in the steps to reproduce to pull a csv file works fine, the issue only appears when attempting to pull an excel file with the SAS token.

psee-code avatar Oct 20 '22 23:10 psee-code

Hi @pseemangal, can you try with .format("excel")? That uses the newer v2 reader which is more similar to the CSV one.

nightscape avatar Oct 21 '22 07:10 nightscape

Hey @nightscape,

It gives the same error: Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key

psee-code avatar Oct 21 '22 12:10 psee-code

Hi @pseemangal, I solved the problem setting all configs through the SparkContext, like this

sc._jsc.hadoopConfiguration().set("fs.azure.account.auth.type.STORAGE_ACCOUNT.dfs.core.windows.net", "SAS") etc.

For some reason this is the only way spark-excel gets them correctly. @nightscape I don't know if this can help solving the bug.

francescosaracco avatar Dec 13 '22 13:12 francescosaracco

@francescosaracco you bring up a good point - the configs seem to only be getting picked up from the spark context regardless of what is set within the spark session. For other file formats (csv, parquet, etc) using native spark, this is not an issue.

williamdphillips avatar Mar 29 '23 20:03 williamdphillips