spark-excel icon indicating copy to clipboard operation
spark-excel copied to clipboard

Saving/Reading excel file into/from Azure BLOB

Open hiimhp opened this issue 6 years ago • 17 comments

Your issue may already be reported! Please search on the issue track before creating one.

Expected Behavior

I am trying to save/write a dataframe into a excel file and also read an excel into a dataframe using databricks

the location of the file is an azure blob storage

Current Behavior

I get an error shaded.databricks.org.apche.hadoop.fs.azure.AxureException: Container container_name in account storageaccountname.blob.core.windows.net not found, and we cant create it using anonymous credentials, and no credentials found for them in the configuration.

I have connected to the blob on the notebook this as per instructions in the databricks website spark.conf.set( "fs.azure.account.key..blob.core.windows.net", "")

Your Environment

My databricks runtime version are as follows:

  1. runtime 5.1
  2. Scala 2.11
  3. Spark 2.4.0

Python Verson 3

hiimhp avatar Jan 20 '19 04:01 hiimhp

Can you try the same thing with another file format, e.g. CSV?

nightscape avatar Jan 21 '19 16:01 nightscape

Same issue here using pyspak, seems like internally this library create a new spark session without the configs. Read and write operations only fails whith:

spark.read
                 .format("com.crealytics.spark.excel")
                 .option("sheetName", options['sheetName'])
                 .option("useHeader", "true")
                 .option("startColumn", options['startColumn'])
                 .option("maxRowsInMemory", 1)
                 .option("skipFirstRows", options['skipFirstRows'])
                 .load(path_file)

vaquer avatar Jan 23 '19 19:01 vaquer

@vaquer you seem to be using a pre 0.11 version according to the parameters you specify. Can you try with 0.11.0 and also check if you get the same error when trying to read a CSV?

nightscape avatar Jan 23 '19 21:01 nightscape

@nightscape Just tested it with 0.11.0, the csv reading works properly. But when I tried to read xlsx it throws me an error with the following line:

spark.read.format("com.crealytics.spark.excel").option("useHeader", "true").load(actual_filename)
Py4JJavaErrorTraceback (most recent call last)
<ipython-input-11-ef29afafe9ea> in <module>
     16 actual_filename = blob_filename + file
     17 print(actual_filename)
---> 18 spark.read.format("com.crealytics.spark.excel").option("useHeader", "true").load(actual_filename)

/opt/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
    164         self.options(**options)
    165         if isinstance(path, basestring):
--> 166             return self._df(self._jreader.load(path))
    167         elif path is not None:
    168             if type(path) != list:

/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/opt/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o145.load.
: scala.MatchError: Map(useheader -> true, path -> wasbs://[email protected]/ibope/raw/201809/Test.xlsx) (of class org.apache.spark.sql.catalyst.util.CaseInsensitiveMap)
	at com.crealytics.spark.excel.DataLocator$.apply(DataLocator.scala:52)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:18)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:12)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:341)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

vaquer avatar Jan 23 '19 22:01 vaquer

@vaquer yes, you're missing the dataAddress option. It should have a default value, but it seems that doesn't work... But this problem is completely unrelated to Azure.

nightscape avatar Jan 24 '19 16:01 nightscape

@vaquer @hiimhp , were you able to solve this issue ? I am facing similar error when trying to Read EXCEL data from Azure Blob storage to Databricks. Please let me know if you guys were able to figure out this issue.

rameshkumarnunna avatar Apr 30 '19 21:04 rameshkumarnunna

Can someone help me with this. I need to ingest source excel to ADLS gen 2 using ADF v2. This has to be further read by Azure DWH external tables. So converting excel to CSV automatically is what i need.

sanjusanjeet avatar Jul 11 '19 10:07 sanjusanjeet

@rameshkumarnunna @sanjusanjeet if you want help, you need to be much more specific. Please post the exact environment you're running the code in, the code you're using to apply spark-excel, the exception or whatever unexpected thing happens.

nightscape avatar Jul 11 '19 16:07 nightscape

Hi @nightscape any chance I could reopen this and ask for help?

My environment is databricks version 5.5 LTS (includes Apache Spark 2.4.3, Scala 2.11) with spark-excel installed via Maven as com.crealytics:spark-excel_2.11:0.12.0

When running the following code: spark.read.option("useHeader", "true").load("wasbs://@.blob.core.windows.net/file_name.xlsx", format = "com.crealytics.spark.excel")

I get the same error as the original poster: shaded.databricks.org.apche.hadoop.fs.azure.AxureException: Container container_name in account storageaccountname.blob.core.windows.net not found, and we cant create it using anonymous credentials, and no credentials found for them in the configuration.

and I get this error loading both a xlsx and csv file when using the .load method

I've used spark.conf.set to set my storage credentials, and note that I can correctly load the csv using spark.read.csv instead.

It would appear that .load is not using my set credentials whereas .csv is? Can you think of any work-around to make the credentials get used properly?

Thanks for your help! :)

Remtasya avatar Sep 05 '19 01:09 Remtasya

Hey @Remtasya, you write

and I get this error loading both a xlsx and csv file when using the .load method

and later

It would appear that .load is not using my set credentials whereas .csv is?

So it appears that the .csv method does something that .load doesn't. If you can locate the csv method in the Spark sources and how it calls .load (which it does with high probability) we could figure out the difference and what would have to be done for spark-excel.

nightscape avatar Sep 09 '19 11:09 nightscape

Hey nightscape - thanks for the response.

I'm still new to scala but here are a few potential leads I found digging around in the spark documentation:

in the test file CSVSuite.scala they run a few tests and seem to use .load alongside .format("csv"), as below - I assume this is the same as using the .csv method?

  test("simple csv test") {
    val cars = spark
      .read
      .format("csv")
      .option("header", "false")
      .load(testFile(carsFile))
    verifyCars(cars, withHeader = false, checkTypes = false)}

in the core file CSVScanBuilder.scala they call another function FileScanBuilder and pass sparkSession to it (is this how you do it for crealytics spark-excel?). Perhaps this is why their .load inherits the spark session config? Code below:

case class CSVScanBuilder(
    sparkSession: SparkSession,
    fileIndex: PartitioningAwareFileIndex,
    schema: StructType,
    dataSchema: StructType,
    options: CaseInsensitiveStringMap)
  extends FileScanBuilder(sparkSession, fileIndex, dataSchema) {

  override def build(): Scan = {
    CSVScan(sparkSession, fileIndex, dataSchema, readDataSchema(), readPartitionSchema(), options)

Also, I noticed that the csv format supports passing a list of files to be appended together as well as a single source file, e.g. as below in the pyspark docs for csv:

        if isinstance(path, basestring):
            path = [path]
        if type(path) == list:
            return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))

Do you know if there is any way to do something similar using your module?

Remtasya avatar Sep 11 '19 06:09 Remtasya

Hey @Remtasya, thanks for digging out relevant code. Can you post the CSV code that works and the version that doesn't here? Especially how you pass any login information to the .csv method.

nightscape avatar Sep 11 '19 18:09 nightscape

Hey @nightscape, thanks for your help.

First I set my azure blob storage credentials as below:

storage_account_name = <storage_account>
storage_account_access_key = <access_key>

spark.conf.set("fs.azure.account.key."+storage_account_name+".blob.core.windows.net",storage_account_access_key)

Then I load a csv from a container:

csv_example = spark.read.schema(example_schema).option("header","true").csv("wasbs://<container_name>@<storage_account>.blob.core.windows.net/csv_example_file.csv")

Trying this with crealytics doesn't work, as below: excel_example = spark.read.format("com.crealytics.spark.excel").option("useHeader", "true").schema(example_schema).load("wasbs://<container_name>@<storage_account>.blob.core.windows.net/excel_example_file.xlsx")

It gives the same error as I get if I put in the wrong storage account with the .csv method, i.e. when it can't find a valid storage account.

As an aside, the csv method can also handle a list of file paths of the same format and will append them together. It can even do this using wildcards, i.e. if you have two files 'csv_part_1' and 'csv_part_2' you can supply it with 'csv_part_*' and it will find them both and combine them. Is there a simple way to do this with crealytics?

Remtasya avatar Sep 12 '19 00:09 Remtasya

@Remtasya could you open a new issue regarding the .load("csv_part_*")? This helps keeping the discussion focussed. Regarding Azure blob storage: Could you try if the approach mentioned here works?

nightscape avatar Sep 12 '19 07:09 nightscape

@nightscape As per your link above, setting the credentials using the Azure cluster config seems to have resolved the problem - thanks for finding a solution!

Remtasya avatar Sep 18 '19 06:09 Remtasya

You could just mount the storage itself and easily access the file through the dbfs as well.

david-nguyen avatar Oct 24 '19 00:10 david-nguyen

@nightscape As per your link above, setting the credentials using the Azure cluster config seems to have resolved the problem - thanks for finding a solution!

@Remtasya Can you explain bit more what value u set in spark conf other than stroage account access key to read/write excel dataframe?

s0y03gy avatar May 13 '22 12:05 s0y03gy