Open
spongebobZ
opened this issue 5 years ago
•
7 comments
Exception in thread "main" java.io.IOException: ZIP entry size is too large or invalid
at shadeio.poi.openxml4j.util.ZipArchiveFakeEntry.(ZipArchiveFakeEntry.java:43)
at shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.(ZipInputStreamZipEntrySource.java:51)
at shadeio.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:106)
at shadeio.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:298)
at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:129)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at shadeio.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:314)
at shadeio.poi.ss.usermodel.WorkbookFactory.createXSSFWorkbook(WorkbookFactory.java:296)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:214)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:180)
at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:42)
at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:42)
at scala.Option.fold(Option.scala:158)
at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:42)
at com.crealytics.spark.excel.WorkbookReader$class.withWorkbook(WorkbookReader.scala:14)
at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:38)
at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:31)
at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:31)
at com.crealytics.spark.excel.ExcelRelation.headerCells$lzycompute(ExcelRelation.scala:33)
at com.crealytics.spark.excel.ExcelRelation.headerCells(ExcelRelation.scala:33)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:148)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:147)
at scala.Option.getOrElse(Option.scala:121)
at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:147)
at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:40)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:40)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:18)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:12)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at test1$.main(test1.scala:12)
at test1.main(test1.scala)
Hi @spongebobZ and @ghoshtir
If is it possible, please help share your excel file (after removing sensitive data)? Or steps to generate (column set, number of rows) an excel file that can reproduce this issue?
There are number of reported issues related to big-excel file and out of memory, so we are trying to collect input to see if we can figure this out. #79 #322 #388
Sincerely,
Actually, this issue is more likely due to the fact that Apache POI works differently when reading files - based on whether it is reading from an InputStream or directly from File. This issue will only happen with reading from Input Streams - POI cannot handle files inside the xlsx zip that are larger than Integer.MAX_VALUE bytes (approx 2Gb) - this is a deliberate limitation because it is doing the work in memory.
It might be useful if spark-excel had a mode where it could optionally read files directly from local file system (from java.io.File as opposed to java.io.FileInputStream).
Hi @pjfanning I am facing the same issue, can you please share the poi jar file of the fix done in https://bz.apache.org/bugzilla/show_bug.cgi?id=65581.
@ghoshtir have a look at https://github.com/crealytics/spark-excel and this line
.option("tempFileThreshold", 10000000) // Optional, default None. Number of bytes at which a zip entry is regarded as too large for holding in memory and the data is put in a temp file instead