spark-excel icon indicating copy to clipboard operation
spark-excel copied to clipboard

read quite big excel error, size=300M

Open spongebobZ opened this issue 5 years ago • 7 comments

Exception in thread "main" java.io.IOException: ZIP entry size is too large or invalid at shadeio.poi.openxml4j.util.ZipArchiveFakeEntry.(ZipArchiveFakeEntry.java:43) at shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.(ZipInputStreamZipEntrySource.java:51) at shadeio.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:106) at shadeio.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:298) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:129) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at shadeio.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:314) at shadeio.poi.ss.usermodel.WorkbookFactory.createXSSFWorkbook(WorkbookFactory.java:296) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:214) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:180) at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:42) at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:42) at scala.Option.fold(Option.scala:158) at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:42) at com.crealytics.spark.excel.WorkbookReader$class.withWorkbook(WorkbookReader.scala:14) at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:38) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:31) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:31) at com.crealytics.spark.excel.ExcelRelation.headerCells$lzycompute(ExcelRelation.scala:33) at com.crealytics.spark.excel.ExcelRelation.headerCells(ExcelRelation.scala:33) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:148) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:147) at scala.Option.getOrElse(Option.scala:121) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:147) at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:40) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:40) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:18) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:12) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at test1$.main(test1.scala:12) at test1.main(test1.scala)

spongebobZ avatar Sep 30 '19 08:09 spongebobZ

Have you tried the maxRowsInMemory option?

nightscape avatar Sep 30 '19 12:09 nightscape

It doesnt appear that using maxRowsInMemory resolves the problems. Is this therefore still an open issue?

ghoshtir avatar Oct 11 '20 09:10 ghoshtir

Hi @spongebobZ and @ghoshtir If is it possible, please help share your excel file (after removing sensitive data)? Or steps to generate (column set, number of rows) an excel file that can reproduce this issue? There are number of reported issues related to big-excel file and out of memory, so we are trying to collect input to see if we can figure this out. #79 #322 #388 Sincerely,

quanghgx avatar Aug 24 '21 15:08 quanghgx

This can possibly be fixed by this - https://stackoverflow.com/questions/46796874/java-io-ioexception-failed-to-read-zip-entry-source.

Note that spark-excel shades the POI classes. org.apache.poi gets shaded to shadeio.poi

pjfanning avatar Sep 18 '21 11:09 pjfanning

Actually, this issue is more likely due to the fact that Apache POI works differently when reading files - based on whether it is reading from an InputStream or directly from File. This issue will only happen with reading from Input Streams - POI cannot handle files inside the xlsx zip that are larger than Integer.MAX_VALUE bytes (approx 2Gb) - this is a deliberate limitation because it is doing the work in memory.

It might be useful if spark-excel had a mode where it could optionally read files directly from local file system (from java.io.File as opposed to java.io.FileInputStream).

pjfanning avatar Sep 18 '21 11:09 pjfanning

I've logged https://bz.apache.org/bugzilla/show_bug.cgi?id=65581 for a possible solution

pjfanning avatar Sep 18 '21 14:09 pjfanning

Great, thank you @pjfanning ! I've added you as a maintainer to this project, you and @quanghgx are doing a fantastic job here!

nightscape avatar Sep 18 '21 19:09 nightscape

Hi @pjfanning I am facing the same issue, can you please share the poi jar file of the fix done in https://bz.apache.org/bugzilla/show_bug.cgi?id=65581.

shishir-22 avatar May 25 '23 09:05 shishir-22

@ghoshtir have a look at https://github.com/crealytics/spark-excel and this line

    .option("tempFileThreshold", 10000000) // Optional, default None. Number of bytes at which a zip entry is regarded as too large for holding in memory and the data is put in a temp file instead

pjfanning avatar May 25 '23 10:05 pjfanning