spark-excel icon indicating copy to clipboard operation
spark-excel copied to clipboard

.option( mode failfast ) is not working

Open girishjambkar opened this issue 2 years ago • 13 comments

Steps to Reproduce (for bugs)

```
Excel File to load: 

image

```

Code to load file: %scala import org.apache.spark.sql._ import org.apache.spark.sql.types._

val myschema = StructType(Array( StructField("Processo", StringType, nullable = false), StructField("Data", TimestampType, nullable = false), StructField("Balcao", StringType, nullable = false), StructField("CriadoPor", StringType, nullable = false) ))

val df = spark.read .format("com.crealytics.spark.excel") .option("dataAddress", "teste!A1") .option("mode", "FAILFAST") .option("header", true) .schema(myschema) .load("/mnt/raw/externalfiles/TESTE.xlsx")

display(df)

Code should fail if I pass wrong/incorrect datetime in data column.

df.show()

Its not throwing exception even calling action.

.option("mode", "FAILFAST") Can you confirm does crealytics support .option(mode=<>) ?

Your Environment

Databricks Runtime Version ==> 10.1 (includes Apache Spark 3.2.0, Scala 2.12) Libraries ==> com.crealytics:spark-excel_2.12:0.13.1

girishjambkar avatar Dec 29 '21 12:12 girishjambkar

@girishjambkar the issue template is supposed to be filled out 😉

nightscape avatar Dec 29 '21 22:12 nightscape

@girishjambkar the issue template is supposed to be filled out 😉

I have filled it now, thanks! Can you please confirm does crealytics support .option(mode=<>) like we can use CSV?

girishjambkar avatar Dec 30 '21 09:12 girishjambkar

Version 1 (which you are using by specifying)

.format("com.crealytics.spark.excel")

does not support this.

Try

.format("excel")

nightscape avatar Dec 30 '21 19:12 nightscape

File "/usr/lib/spark/python/lib/py4j-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o142.load. : java.lang.ClassNotFoundException: Failed to find data source: excel. Please find packages at http://spark.apache.org/third-party-projects.html at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: excel.DefaultSource at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634) at scala.util.Try.orElse(Try.scala:84) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634) ... 13 more

girishjambkar avatar Dec 31 '21 04:12 girishjambkar

@nightscape I am unable to use .format("excel") Getting above exception

girishjambkar avatar Dec 31 '21 04:12 girishjambkar

You're using an outdated version, please upgrade.

nightscape avatar Dec 31 '21 13:12 nightscape

image

I have tried running using .format("ëxcel"). Its failing because of above error. @nightscape

Databricks Runtime Version ==> 10.1 (includes Apache Spark 3.2.0, Scala 2.12) Libraries ==> com.crealytics:spark-excel_2.13:3.2.0_0.16.0

Please help.

girishjambkar avatar Dec 31 '21 16:12 girishjambkar

Could you expand the exception and post it here?

nightscape avatar Jan 01 '22 21:01 nightscape

Error trace

NoClassDefFoundError: shadeio/commons/io/output/UnsynchronizedByteArrayOutputStream Caused by: ClassNotFoundException: shadeio.commons.io.output.UnsynchronizedByteArrayOutputStream at shadeio.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:209) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:206) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:172) at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:55) at scala.Option.fold(Option.scala:251) at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:55) at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:16) at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:15) at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:50) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104) at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103) at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:172) at scala.Option.getOrElse(Option.scala:189) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:171) at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:36) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:36) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:390) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:346) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:346) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:237) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-2541019815824441:3) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw$$iw$$iw$$iw$$iw.(command-2541019815824441:47) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw$$iw$$iw$$iw.(command-2541019815824441:49) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw$$iw$$iw.(command-2541019815824441:51) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw$$iw.(command-2541019815824441:53) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw.(command-2541019815824441:55) at $line03e3e0503061413eab90de3bf6be643427.$read.(command-2541019815824441:57) at $line03e3e0503061413eab90de3bf6be643427.$read$.(command-2541019815824441:61) at $line03e3e0503061413eab90de3bf6be643427.$read$.(command-2541019815824441) at $line03e3e0503061413eab90de3bf6be643427.$eval$.$print$lzycompute(:7) at $line03e3e0503061413eab90de3bf6be643427.$eval$.$print(:6) at $line03e3e0503061413eab90de3bf6be643427.$eval.$print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:747) at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1020) at scala.tools.nsc.interpreter.IMain.$anonfun$interpret$1(IMain.scala:568) at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36) at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116) at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:41) at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:567) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:594) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:564) at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:219) at com.databricks.backend.daemon.driver.ScalaDriverLocal.$anonfun$repl$1(ScalaDriverLocal.scala:235) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:902) at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:855) at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:235) at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$13(DriverLocal.scala:541) at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:266) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:261) at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:258) at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:50) at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:305) at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:297) at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:50) at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:518) at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:689) at scala.util.Try$.apply(Try.scala:213) at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:681) at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:522) at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:634) at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:427) at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:370) at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:221) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: shadeio.commons.io.output.UnsynchronizedByteArrayOutputStream at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader.loadClass(ClassLoaders.scala:151) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) at shadeio.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:209) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:206) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:172) at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:55) at scala.Option.fold(Option.scala:251) at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:55) at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:16) at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:15) at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:50) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104) at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103) at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:172) at scala.Option.getOrElse(Option.scala:189) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:171) at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:36) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:36) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:390) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:346) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:346) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:237) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-2541019815824441:3) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw$$iw$$iw$$iw$$iw.(command-2541019815824441:47) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw$$iw$$iw$$iw.(command-2541019815824441:49) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw$$iw$$iw.(command-2541019815824441:51) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw$$iw.(command-2541019815824441:53) at $line03e3e0503061413eab90de3bf6be643427.$read$$iw.(command-2541019815824441:55) at $line03e3e0503061413eab90de3bf6be643427.$read.(command-2541019815824441:57) at $line03e3e0503061413eab90de3bf6be643427.$read$.(command-2541019815824441:61) at $line03e3e0503061413eab90de3bf6be643427.$read$.(command-2541019815824441) at $line03e3e0503061413eab90de3bf6be643427.$eval$.$print$lzycompute(:7) at $line03e3e0503061413eab90de3bf6be643427.$eval$.$print(:6) at $line03e3e0503061413eab90de3bf6be643427.$eval.$print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:747) at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1020) at scala.tools.nsc.interpreter.IMain.$anonfun$interpret$1(IMain.scala:568) at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36) at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116) at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:41) at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:567) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:594) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:564) at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:219) at com.databricks.backend.daemon.driver.ScalaDriverLocal.$anonfun$repl$1(ScalaDriverLocal.scala:235) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:902) at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:855) at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:235) at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$13(DriverLocal.scala:541) at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:266) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:261) at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:258) at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:50) at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:305) at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:297) at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:50) at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:518) at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:689) at scala.util.Try$.apply(Try.scala:213) at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:681) at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:522) at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:634) at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:427) at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:370) at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:221) at java.lang.Thread.run(Thread.java:748)

girishjambkar avatar Jan 04 '22 03:01 girishjambkar

@nightscape I am getting above exception in Databricks notebook, I can try to implement same in AWS EMR Cluster meanwhile, Can you please confirm does crealytics support .option(mode=<>) in upgraded version like .format("excel")? Official documentation https://github.com/crealytics/spark-excel doesn't have anything like .option(mode=<>) ! Also it would be great if official documentation having some examples/sample to test same?

I could see similar problems while reading Custom Schema using crealytics : https://stackoverflow.com/questions/70540400/spark-read-excel-not-reading-all-excel-rows-when-using-custom-schema/70565795#70565795

girishjambkar avatar Jan 04 '22 04:01 girishjambkar

Hi @girishjambkar and @nightscape Let me check the support of .option(mode, MODE) and get back to you. This looks like an issue in V2 implementation.

quanghgx avatar Jan 10 '22 12:01 quanghgx

Hi @quanghgx I also get this issue running an Azure Databricks cluster 9.1 on Apache Spark 3.1.2 with com.crealytics:spark-excel_2.12:3.1.2_0.16.1-pre1:

image

alexjbush avatar Jan 27 '22 20:01 alexjbush

0.16.1-pre1 unfortunately is broken. I tried shading dependencies differently but made things worse. 0.16.1-pre2 would have been another shot at this, but the build is failing for a reason I don't understand yet. For now, please try 0.16.0.

nightscape avatar Jan 28 '22 21:01 nightscape