spark-excel icon indicating copy to clipboard operation
spark-excel copied to clipboard

com.crealytics.spark.excel doesn't read directly from ADL

Open Mathyaku opened this issue 5 years ago • 5 comments

I'm getting the following error:

[2019-05-29 18:25:21,894] {init.py:1580} ERROR - An error occurred while calling o77.load. : java.io.IOException: Password fs.adl.oauth2.client.id not found at org.apache.hadoop.fs.adl.AdlFileSystem.getPasswordString(AdlFileSystem.java:950) at org.apache.hadoop.fs.adl.AdlFileSystem.getConfCredentialBasedTokenProvider(AdlFileSystem.java:289)

ex1- DOESN'T WORK:

spark = sparkSession.... spark.read.format("com.crealytics.spark.excel")
.option("useHeader", "false")
.option("skipFirstRows","15")
.load("adl://test.azuredatalakestore.net/teste.xls")

PS:

If I try to read any file from my adl with that sparkSession and then read the .xls everything works.

ex2 - WORKS:

spark = sparkSession.... spark.read.format("csv")
.option("useHeader", "false")
.option("skipFirstRows","15")
.load("adl://test.azuredatalakestore.net/teste2.csv")

spark.read.format("com.crealytics.spark.excel")
.option("useHeader", "false")
.option("skipFirstRows","15")
.load("adl://test.azuredatalakestore.net/teste.xls")

Mathyaku avatar May 29 '19 18:05 Mathyaku

Hmm, I'm quite clueless, what we'd have to do to support ADL properly. Would you be willing to contribute a PR or dig out the corresponding documentation? We don't have this use case and can't spend much time on this...

nightscape avatar May 30 '19 19:05 nightscape

We are trying below from Databricks, per them this is the update.

This is because the Spark reader used to load the excel file does not honor the configs given as Hadoop configuration and it does not load the same

Repro code, Data lake store (ADL) is an Azure storage platform, the problem is only when you reference a full path like below. But when you mount the storage platform as a mount point on Databricks, problem does not occur.

dayreportfullpath = spark.read.format("com.crealytics.spark.excel").option("useHeader", "true").load("adl://aravishdatalake.azuredatalakestore.net/external/Test.xlsx")

IllegalArgumentException: 'No value for dfs.adls.oauth2.access.token.provider found in conf file.'

IllegalArgumentException Traceback (most recent call last) in () ----> 1 dayreportfullpath = spark.read.format("com.crealytics.spark.excel").option("useHeader", "true").load("adl://aravishdatalake.azuredatalakestore.net/external/Test.xlsx")

/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options) 164 self.options(**options) 165 if isinstance(path, basestring): --> 166 return self._df(self._jreader.load(path)) 167 elif path is not None: 168 if type(path) != list:

/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in call(self, *args) 1255 answer = self.gateway_client.send_command(command) 1256 return_value = get_return_value( -> 1257 answer, self.gateway_client, self.target_id, self.name) 1258 1259 for temp_arg in temp_args:

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace) 78 if s.startswith('java.lang.IllegalArgumentException: '): ---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) 80 raise 81 return deco

IllegalArgumentException: 'No value for dfs.adls.oauth2.access.token.provider found in conf file.'

aravish avatar Jul 10 '19 14:07 aravish

For anyone else having this issue, you need to use the RDD context. You can also mount it, but in some cases you may be averse to mounting (like my use case).

spark.sparkContext.hadoopConfiguration.set(...

This is what worked for me earlier today. Was able to read from ADLS without mounting.

brickfrog avatar Oct 18 '19 00:10 brickfrog

@BrickFrog mounting does not work for me, I still get the mentioned error. Can you provide your code as an example? also what do you mean by use RDD context? can you provide an example?

axen22 avatar Oct 23 '19 16:10 axen22

For someone who comes here - looking for a Pyspark solution Spark 3.1.2 Cannot read abyss:// url with spark-excel

Use com.crealytics:spark-excel_2.12:0.13.7 and set the Azure OAuth parameters with spark._jsc.hadoopConfiguration().set(key, value) in addition to spark.conf.set(key, value)

@brickfrog - thanks for pointing us in the right direction.

divyavanmahajan avatar May 16 '22 14:05 divyavanmahajan