spark-excel icon indicating copy to clipboard operation
spark-excel copied to clipboard

[BUG] error java.lang.NoClassDefFoundError when trying to load specific sheet from excel file

Open chmpsymp opened this issue 2 years ago • 2 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

Hi,

I have been trying to read a specific sheet from an excel file located in delta lake with databricks but get the following error:

Py4JJavaError: An error occurred while calling o503.load. : java.lang.NoClassDefFoundError: shadeio/poi/schemas/vmldrawing/XmlDocument at shadeio.poi.xssf.usermodel.XSSFVMLDrawing.read(XSSFVMLDrawing.java:135) at shadeio.poi.xssf.usermodel.XSSFVMLDrawing.(XSSFVMLDrawing.java:123) at shadeio.poi.ooxml.POIXMLFactory.createDocumentPart(POIXMLFactory.java:61) at shadeio.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:661) at shadeio.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:678) at shadeio.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:165) at shadeio.poi.xssf.usermodel.XSSFWorkbook.(XSSFWorkbook.java:259) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:118) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:98) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36) at shadeio.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224) at shadeio.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185) at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:55) at scala.Option.fold(Option.scala:251) at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:55) at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:16) at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:15) at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:50) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104) at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103) at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:172) at scala.Option.getOrElse(Option.scala:189) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:171) at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:36) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:36) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:390) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:444) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:400) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:400) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:287) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:295) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:748)

The excel file contains pictures, but not in the sheet that I am trying to read. If I create a new excel file only containing the sheet that I need the problem does not occur.

Is it possible that you can help me figure out the issue? Thanks!

Expected Behavior

The expected behavior is that the the specific sheet is loadet and a dataframe is created.

Steps To Reproduce

df = spark.read.format("com.crealytics.spark.excel").option("dataAddress", "''!A1:EF1000").option("header", "true").option("treatEmptyValuesAsNulls", "true").option("inferSchema", "false").load()

Unfortunately I do not have access to the example file.

Environment

- Spark version: 3.2.1 language: Python
- Spark-Excel version: 0.17.1
- OS: Windows 11
- Cluster environment: Databricks runtime version 10.5, 32GB memory, 4 cores

Anything else?

No response

chmpsymp avatar Jun 13 '22 08:06 chmpsymp

Probably related to https://github.com/crealytics/spark-excel/issues/457#issuecomment-984164190 which might be solved by https://github.com/crealytics/spark-excel/pull/597

nightscape avatar Jun 15 '22 07:06 nightscape

@nightscape thank you for your comment. Will this be available in the next release, or when should I test if #597 solves the problem?

chmpsymp avatar Jun 17 '22 08:06 chmpsymp

@chmpsymp 0.17.2 should hopefully fix this. Please give it a try and post your results here. I'll close this ticket in the mean time.

nightscape avatar Aug 20 '22 00:08 nightscape

@nightscape thanks for letting me know.

I have tested again with version "com.crealytics:spark-excel_2.12:3.2.2_0.17.2" but unfortunately I still get an error similar to the one above.

Py4JJavaError: An error occurred while calling o817.load. : java.lang.NoClassDefFoundError: shadeio/poi/schemas/vmldrawing/XmlDocument at shadeio.poi.xssf.usermodel.XSSFVMLDrawing.read(XSSFVMLDrawing.java:135) at shadeio.poi.xssf.usermodel.XSSFVMLDrawing.(XSSFVMLDrawing.java:123) at shadeio.poi.ooxml.POIXMLFactory.createDocumentPart(POIXMLFactory.java:61) at shadeio.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:661) at shadeio.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:678) at shadeio.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:165) at shadeio.poi.xssf.usermodel.XSSFWorkbook.(XSSFWorkbook.java:259) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:118) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:98) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36) at shadeio.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224) at shadeio.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185) at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$3(WorkbookReader.scala:102) at scala.Option.fold(Option.scala:251) at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:102) at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:33) at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:32) at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:87) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:48) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:48) at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:121) at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:120) at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:189) at scala.Option.getOrElse(Option.scala:189) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:188) at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:52) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:52) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:24) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:368) at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:324) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:324) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:237) at sun.reflect.GeneratedMethodAccessor775.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:306) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195) at py4j.ClientServerConnection.run(ClientServerConnection.java:115) at java.lang.Thread.run(Thread.java:748)

chmpsymp avatar Aug 29 '22 09:08 chmpsymp

0.18.0-beta2 fixes this

pjfanning avatar Aug 29 '22 09:08 pjfanning

@pjfanning this works for me. Thank you!

chmpsymp avatar Aug 29 '22 09:08 chmpsymp