spark-excel read quite big excel error, size=300M

Exception in thread "main" java.io.IOException: ZIP entry size is too large or invalid at shadeio.poi.openxml4j.util.ZipArchiveFakeEntry.(ZipArchiveFakeEntry.java:43) at shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.(ZipInputStreamZipEntrySource.java:51) at shadeio.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:106) at shadeio.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:298) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:129) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at shadeio.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:314) at shadeio.poi.ss.usermodel.WorkbookFactory.createXSSFWorkbook(WorkbookFactory.java:296) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:214) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:180) at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:42) at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:42) at scala.Option.fold(Option.scala:158) at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:42) at com.crealytics.spark.excel.WorkbookReader$class.withWorkbook(WorkbookReader.scala:14) at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:38) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:31) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:31) at com.crealytics.spark.excel.ExcelRelation.headerCells$lzycompute(ExcelRelation.scala:33) at com.crealytics.spark.excel.ExcelRelation.headerCells(ExcelRelation.scala:33) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:148) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:147) at scala.Option.getOrElse(Option.scala:121) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:147) at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:40) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:40) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:18) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:12) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at test1$.main(test1.scala:12) at test1.main(test1.scala)

Sep 30 '19 08:09 spongebobZ

Have you tried the maxRowsInMemory option?

Sep 30 '19 12:09 nightscape

It doesnt appear that using maxRowsInMemory resolves the problems. Is this therefore still an open issue?

Oct 11 '20 09:10 ghoshtir

Hi @spongebobZ and @ghoshtir If is it possible, please help share your excel file (after removing sensitive data)? Or steps to generate (column set, number of rows) an excel file that can reproduce this issue? There are number of reported issues related to big-excel file and out of memory, so we are trying to collect input to see if we can figure this out. #79 #322 #388 Sincerely,

Aug 24 '21 15:08 quanghgx

This can possibly be fixed by this - https://stackoverflow.com/questions/46796874/java-io-ioexception-failed-to-read-zip-entry-source.

Note that spark-excel shades the POI classes. org.apache.poi gets shaded to shadeio.poi

Sep 18 '21 11:09 pjfanning

Actually, this issue is more likely due to the fact that Apache POI works differently when reading files - based on whether it is reading from an InputStream or directly from File. This issue will only happen with reading from Input Streams - POI cannot handle files inside the xlsx zip that are larger than Integer.MAX_VALUE bytes (approx 2Gb) - this is a deliberate limitation because it is doing the work in memory.

It might be useful if spark-excel had a mode where it could optionally read files directly from local file system (from java.io.File as opposed to java.io.FileInputStream).

Sep 18 '21 11:09 pjfanning

I've logged https://bz.apache.org/bugzilla/show_bug.cgi?id=65581 for a possible solution

Sep 18 '21 14:09 pjfanning

Great, thank you @pjfanning ! I've added you as a maintainer to this project, you and @quanghgx are doing a fantastic job here!

Sep 18 '21 19:09 nightscape

Hi @pjfanning I am facing the same issue, can you please share the poi jar file of the fix done in https://bz.apache.org/bugzilla/show_bug.cgi?id=65581.

May 25 '23 09:05 shishir-22

@ghoshtir have a look at https://github.com/crealytics/spark-excel and this line

    .option("tempFileThreshold", 10000000) // Optional, default None. Number of bytes at which a zip entry is regarded as too large for holding in memory and the data is put in a temp file instead

May 25 '23 10:05 pjfanning

spark-excel spark-excel copied to clipboard

read quite big excel error, size=300M

spark-excel
spark-excel copied to clipboard