spark-excel
spark-excel copied to clipboard
Saving/Reading excel file into/from Azure BLOB
Your issue may already be reported! Please search on the issue track before creating one.
Expected Behavior
I am trying to save/write a dataframe into a excel file and also read an excel into a dataframe using databricks
the location of the file is an azure blob storage
Current Behavior
I get an error shaded.databricks.org.apche.hadoop.fs.azure.AxureException: Container container_name in account storageaccountname.blob.core.windows.net not found, and we cant create it using anonymous credentials, and no credentials found for them in the configuration.
I have connected to the blob on the notebook this as per instructions in the databricks website
spark.conf.set(
"fs.azure.account.key.
Your Environment
My databricks runtime version are as follows:
- runtime 5.1
- Scala 2.11
- Spark 2.4.0
Python Verson 3
Can you try the same thing with another file format, e.g. CSV?
Same issue here using pyspak, seems like internally this library create a new spark session without the configs. Read and write operations only fails whith:
spark.read
.format("com.crealytics.spark.excel")
.option("sheetName", options['sheetName'])
.option("useHeader", "true")
.option("startColumn", options['startColumn'])
.option("maxRowsInMemory", 1)
.option("skipFirstRows", options['skipFirstRows'])
.load(path_file)
@vaquer you seem to be using a pre 0.11 version according to the parameters you specify. Can you try with 0.11.0 and also check if you get the same error when trying to read a CSV?
@nightscape Just tested it with 0.11.0, the csv reading works properly. But when I tried to read xlsx it throws me an error with the following line:
spark.read.format("com.crealytics.spark.excel").option("useHeader", "true").load(actual_filename)
Py4JJavaErrorTraceback (most recent call last)
<ipython-input-11-ef29afafe9ea> in <module>
16 actual_filename = blob_filename + file
17 print(actual_filename)
---> 18 spark.read.format("com.crealytics.spark.excel").option("useHeader", "true").load(actual_filename)
/opt/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
164 self.options(**options)
165 if isinstance(path, basestring):
--> 166 return self._df(self._jreader.load(path))
167 elif path is not None:
168 if type(path) != list:
/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/opt/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o145.load.
: scala.MatchError: Map(useheader -> true, path -> wasbs://[email protected]/ibope/raw/201809/Test.xlsx) (of class org.apache.spark.sql.catalyst.util.CaseInsensitiveMap)
at com.crealytics.spark.excel.DataLocator$.apply(DataLocator.scala:52)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:18)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:12)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:341)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
@vaquer yes, you're missing the dataAddress option. It should have a default value, but it seems that doesn't work...
But this problem is completely unrelated to Azure.
@vaquer @hiimhp , were you able to solve this issue ? I am facing similar error when trying to Read EXCEL data from Azure Blob storage to Databricks. Please let me know if you guys were able to figure out this issue.
Can someone help me with this. I need to ingest source excel to ADLS gen 2 using ADF v2. This has to be further read by Azure DWH external tables. So converting excel to CSV automatically is what i need.
@rameshkumarnunna @sanjusanjeet if you want help, you need to be much more specific. Please post the exact environment you're running the code in, the code you're using to apply spark-excel, the exception or whatever unexpected thing happens.
Hi @nightscape any chance I could reopen this and ask for help?
My environment is databricks version 5.5 LTS (includes Apache Spark 2.4.3, Scala 2.11) with spark-excel installed via Maven as com.crealytics:spark-excel_2.11:0.12.0
When running the following code:
spark.read.option("useHeader", "true").load("wasbs://
I get the same error as the original poster: shaded.databricks.org.apche.hadoop.fs.azure.AxureException: Container container_name in account storageaccountname.blob.core.windows.net not found, and we cant create it using anonymous credentials, and no credentials found for them in the configuration.
and I get this error loading both a xlsx and csv file when using the .load method
I've used spark.conf.set to set my storage credentials, and note that I can correctly load the csv using spark.read.csv instead.
It would appear that .load is not using my set credentials whereas .csv is? Can you think of any work-around to make the credentials get used properly?
Thanks for your help! :)
Hey @Remtasya, you write
and I get this error loading both a xlsx and csv file when using the .load method
and later
It would appear that .load is not using my set credentials whereas .csv is?
So it appears that the .csv method does something that .load doesn't.
If you can locate the csv method in the Spark sources and how it calls .load (which it does with high probability) we could figure out the difference and what would have to be done for spark-excel.
Hey nightscape - thanks for the response.
I'm still new to scala but here are a few potential leads I found digging around in the spark documentation:
in the test file CSVSuite.scala they run a few tests and seem to use .load alongside .format("csv"), as below - I assume this is the same as using the .csv method?
test("simple csv test") {
val cars = spark
.read
.format("csv")
.option("header", "false")
.load(testFile(carsFile))
verifyCars(cars, withHeader = false, checkTypes = false)}
in the core file CSVScanBuilder.scala they call another function FileScanBuilder and pass sparkSession to it (is this how you do it for crealytics spark-excel?). Perhaps this is why their .load inherits the spark session config? Code below:
case class CSVScanBuilder(
sparkSession: SparkSession,
fileIndex: PartitioningAwareFileIndex,
schema: StructType,
dataSchema: StructType,
options: CaseInsensitiveStringMap)
extends FileScanBuilder(sparkSession, fileIndex, dataSchema) {
override def build(): Scan = {
CSVScan(sparkSession, fileIndex, dataSchema, readDataSchema(), readPartitionSchema(), options)
Also, I noticed that the csv format supports passing a list of files to be appended together as well as a single source file, e.g. as below in the pyspark docs for csv:
if isinstance(path, basestring):
path = [path]
if type(path) == list:
return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
Do you know if there is any way to do something similar using your module?
Hey @Remtasya, thanks for digging out relevant code.
Can you post the CSV code that works and the version that doesn't here?
Especially how you pass any login information to the .csv method.
Hey @nightscape, thanks for your help.
First I set my azure blob storage credentials as below:
storage_account_name = <storage_account>
storage_account_access_key = <access_key>
spark.conf.set("fs.azure.account.key."+storage_account_name+".blob.core.windows.net",storage_account_access_key)
Then I load a csv from a container:
csv_example = spark.read.schema(example_schema).option("header","true").csv("wasbs://<container_name>@<storage_account>.blob.core.windows.net/csv_example_file.csv")
Trying this with crealytics doesn't work, as below:
excel_example = spark.read.format("com.crealytics.spark.excel").option("useHeader", "true").schema(example_schema).load("wasbs://<container_name>@<storage_account>.blob.core.windows.net/excel_example_file.xlsx")
It gives the same error as I get if I put in the wrong storage account with the .csv method, i.e. when it can't find a valid storage account.
As an aside, the csv method can also handle a list of file paths of the same format and will append them together. It can even do this using wildcards, i.e. if you have two files 'csv_part_1' and 'csv_part_2' you can supply it with 'csv_part_*' and it will find them both and combine them. Is there a simple way to do this with crealytics?
@Remtasya could you open a new issue regarding the .load("csv_part_*")? This helps keeping the discussion focussed.
Regarding Azure blob storage: Could you try if the approach mentioned here works?
@nightscape As per your link above, setting the credentials using the Azure cluster config seems to have resolved the problem - thanks for finding a solution!
You could just mount the storage itself and easily access the file through the dbfs as well.
@nightscape As per your link above, setting the credentials using the Azure cluster config seems to have resolved the problem - thanks for finding a solution!
@Remtasya Can you explain bit more what value u set in spark conf other than stroage account access key to read/write excel dataframe?