spark-excel icon indicating copy to clipboard operation
spark-excel copied to clipboard

[BUG] Cannot read files into dataframe in Databricks 9.1 LTS Runtime 3.1.2 Spark

Open saipraneethreddy opened this issue 2 years ago • 6 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

Below is my set up spark.conf.set("fs.azure.account.auth.type." + storageAccountName + ".dfs.core.windows.net", "OAuth") spark.conf.set("fs.azure.account.oauth.provider.type." + storageAccountName + ".dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") spark.conf.set("fs.azure.account.oauth2.client.id." + storageAccountName + ".dfs.core.windows.net", "" + appID + "") spark.conf.set("fs.azure.account.oauth2.client.secret." + storageAccountName + ".dfs.core.windows.net", "" + password + "") spark.conf.set("fs.azure.account.oauth2.client.endpoint." + storageAccountName + ".dfs.core.windows.net", "https://login.microsoftonline.com/" + tenantID + "/oauth2/token")

com.crealytics:spark-excel-2.12.17-3.1.2_2.12:3.1.2_0.18.1

tried to check %fs ls - files are getting listed as expected I can read csv files from the path on gen2 storage

crealytics_issue

Expected Behavior

I would expect to read excel file into a dataframe and not get below error get InvalidConfigurationValueException: Invalid configuration value detected for fs.azure.account.key

Steps To Reproduce

No response

Environment

- Spark version:3.1.2
- Spark-Excel version: com.crealytics:spark-excel-2.12.17-3.1.2_2.12:3.1.2_0.18.1
- OS: Windows 10
- Cluster environment : 9.1 LTS

Anything else?

No response

saipraneethreddy avatar Feb 06 '23 22:02 saipraneethreddy

Please check these potential duplicates:

  • [#682] [BUG] Cannot read files into dataframe in Databricks 11.3 LTS Runtime 3.3.0 Spark (75.91%) If this issue is a duplicate, please add any additional info to the ticket with the most information and close this one.

github-actions[bot] avatar Feb 06 '23 22:02 github-actions[bot]

for what its worth (as i was troubleshooting similar yesterday and struggling to find anything other than actually using an account key or SAS key) This article provides a workaround https://stackoverflow.com/questions/73864385/invalid-configuration-value-detected-for-fs-azure-account-key-with-com-crealytic/73865969#73865969 which worked for us with com.crealytics:spark-excel-2.12.17-3.2.2_2.12:3.2.2_0.18.1 and using "excel"

replacing spark.conf.set with spark._jsc.hadoopConfiguration().set

eg

spark._jsc.hadoopConfiguration().set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark._jsc.hadoopConfiguration().set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark._jsc.hadoopConfiguration().set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark._jsc.hadoopConfiguration().set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark._jsc.hadoopConfiguration().set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

gtowsey avatar Mar 16 '23 23:03 gtowsey

Spark Version: 3.2.1 Databricks Runtime: 10.4LTS

In our scenario, setting a property on the spark context instead of the spark session is not an option. The same configs that work when setting on spark session for reading other files do not work with excel. Our setup may be a bit different because we are using a refresh token for authentication instead of client credentials, but nonetheless using spark session properties should still work.

spark.conf.set(f'fs.azure.account.auth.type', 'OAuth') spark.conf.set(f'fs.azure.account.oauth.provider.type', 'org.apache.hadoop.fs.azurebfs.oauth2.RefreshTokenBasedTokenProvider') spark.conf.set(f'fs.azure.account.oauth2.refresh.token', refresh_token) spark.conf.set(f'fs.azure.account.oauth2.refresh.endpoint', f'https://login.microsoftonline.com/{tenant_id}/oauth2/token') spark.conf.set(f'fs.azure.account.oauth2.client.id', client_id)

I tried setting the configs with and without storage account, but they seem to only work when set on spark context and not spark session. This issue persists in v1 and v2 of this library. @nightscape I plan on looking deeper at this, but would you be aware of any reason why the library is only working with configs from spark context instead of spark session?

williamdphillips avatar Mar 29 '23 20:03 williamdphillips

@williamdphillips in this line we're passing the HadoopConfiguration to the WorkbookReader and then use it here to actually read from the filesystem. There might be another way to read from the Hadoop filesystem that also takes into account parameters that were set on the SparkSession. Could you check how this is done in other readers (e.g. Parquet, Avro, CSV)?

nightscape avatar Mar 30 '23 07:03 nightscape

@nightscape I see some usage of sparkSession.sessionState.newHadoopConfWithOptions(options) here, although I'm not sure where the best place to implement that would be.

Found some more information about this method here.

I'm thinking we can probably update the instantiation of WorkbookReader to be like this:

val conf = sqlContext.sparkSession.sessionState.newHadoopConf()
val wbReader = WorkbookReader(parameters, conf)

Created a draft PR (haven't tested yet due to some SBT issues): https://github.com/crealytics/spark-excel/pull/726

williamdphillips avatar Mar 30 '23 15:03 williamdphillips

@nightscape Realized that for reading the issue is fixed - but PR didn't address the issue when writing. Looking through spark source code and this repo's source code to try and find where to add a similar solution.

Edit - Here is PR: https://github.com/crealytics/spark-excel/pull/728

williamdphillips avatar Apr 03 '23 20:04 williamdphillips