spark-excel [BUG] org.apache.poi.util.RecordFormatException: Not enough data (0) to read requested (2) bytes

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

df = spark.read \
    .format("excel") \
    .option("header", "true") \
    .load("/path/1998 Household Trend Data1.xls")

I included xls file as attachment. i think this bug already fixed in latest Poi library according to https://bz.apache.org/bugzilla/show_bug.cgi?id=65543 , will you include it in spark-excel latest build?

1998 Household Trend Data1.xls

Expected Behavior

able to read the excel file

Steps To Reproduce

detail of error

py4j.protocol.Py4JJavaError: An error occurred while calling o51.load.
: org.apache.poi.util.RecordFormatException: Not enough data (0) to read requested (2) bytes
        at org.apache.poi.hssf.record.RecordInputStream.checkRecordPosition(RecordInputStream.java:234)
        at org.apache.poi.hssf.record.RecordInputStream.readShort(RecordInputStream.java:253)
        at org.apache.poi.hssf.record.PrintSetupRecord.<init>(PrintSetupRecord.java:89)
        at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:95)
        at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:289)
        at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:255)
        at org.apache.poi.hssf.record.RecordFactory.createRecords(RecordFactory.java:187)
        at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:371)
        at org.apache.poi.hssf.usermodel.HSSFWorkbookFactory.create(HSSFWorkbookFactory.java:79)
        at org.apache.poi.hssf.usermodel.HSSFWorkbookFactory.create(HSSFWorkbookFactory.java:37)
        at org.apache.poi.ss.usermodel.WorkbookFactory.lambda$create$3(WorkbookFactory.java:235)
        at org.apache.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329)
        at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:235)
        at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
        at com.crealytics.spark.excel.v2.ExcelHelper.getWorkbook(ExcelHelper.scala:120)
        at com.crealytics.spark.excel.v2.ExcelHelper.getSheetData(ExcelHelper.scala:137)
        at com.crealytics.spark.excel.v2.ExcelHelper.parseSheetData(ExcelHelper.scala:160)
        at com.crealytics.spark.excel.v2.ExcelTable.infer(ExcelTable.scala:77)
        at com.crealytics.spark.excel.v2.ExcelTable.inferSchema(ExcelTable.scala:48)
        at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:70)
        at scala.Option.orElse(Option.scala:447)
        at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:70)
        at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:64)
        at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82)
        at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80)
        at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema(FileDataSourceV2.scala:94)
        at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema$(FileDataSourceV2.scala:92)
        at com.crealytics.spark.excel.v2.ExcelDataSource.inferSchema(ExcelDataSource.scala:27)
        at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:90)
        at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:132)
        at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:209)
        at scala.Option.flatMap(Option.scala:271)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:567)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.base/java.lang.Thread.run(Thread.java:835

Environment

- Spark version: 3.3.0
- Spark-Excel version: 0.18.5
- OS:
- Cluster environment

Anything else?

Jan 13 '23 05:01 cometta

Scala-steward usually takes care of updating dependencies. We're on POI 5.2.3. Should the bug be fixed there?

Jan 13 '23 09:01 nightscape

the issue just discovered a week ago https://bz.apache.org/bugzilla/show_bug.cgi?id=65543 , i dont think it is fixed on POI 5.2.3

Jan 14 '23 01:01 cometta

Ok. Once there's a release with the fix Scala-steward should create a PR soon ish.

Jan 14 '23 19:01 nightscape

https://github.com/apache/poi/actions/runs/3859374594

Jan 16 '23 20:01 gorshkov-leonid

Does anyone know the Apache POI release lifecycle and if this fix will be available in short term?

May 23 '23 15:05 mtovmassian

Good question. @pjfanning I saw you made some contributions to POI. Are you aware when a new release might be made?

May 23 '23 15:05 nightscape

Does spark-excel support the ancient xls format? The POI issue only affects that dinosaur format.

I am a POI contributor but the community is not active. I have done most of the recent releases and don't want to do more. The ASF regards over reliance on one contributor to do releases as big sign of community malaise.

May 23 '23 15:05 pjfanning

This issue appears to be rare (affects very few xls files) and users who hit can use Excel and resave the xls as an xlsx - and the xlsx can be loaded using spark-excel instead. There are many other tools that will open an xls and save it as xlsx if you don't have Excel licenses.

May 23 '23 15:05 pjfanning

@cometta @mtovmassian would it be an option to convert the file to .xlsx format? As @pjfanning mentioned, .xls is really ancient and should not be produced by current tools anymore.

May 23 '23 21:05 nightscape

Thank you @nightscape and thank you @pjfanning for the details you gave. In fact my use case is not related to spark-excel. I commented this bug ticket since it is the only place I found where this Apache POI issue is discussed. I'm working with a legacy software that relies on .xls format. And in the future yes the .xls -> .xlsx conversion will be an option, but right now we need to look for another workaround.

May 23 '23 22:05 mtovmassian