spark-excel GC overhead while ingesting big excel files

Hi,

We are having the following stacktrace when ingesting big excel files (over 10000 lines):

java.lang.OutOfMemoryError: GC overhead limit exceeded
                at java.util.TreeMap.put(TreeMap.java:577)
                at shadeio.poi.xssf.usermodel.XSSFRow.<init>(XSSFRow.java:80)
                at shadeio.poi.xssf.usermodel.XSSFSheet.initRows(XSSFSheet.java:268)
                at shadeio.poi.xssf.usermodel.XSSFSheet.read(XSSFSheet.java:231)
                at shadeio.poi.xssf.usermodel.XSSFSheet.onDocumentRead(XSSFSheet.java:218)
                at shadeio.poi.xssf.usermodel.XSSFWorkbook.parseSheet(XSSFWorkbook.java:454)
                at shadeio.poi.xssf.usermodel.XSSFWorkbook.onDocumentRead(XSSFWorkbook.java:419)
                at shadeio.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184)
                at shadeio.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:288)
                at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:97)
                at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:147)
                at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory$$Lambda$36/1626927534.apply(Unknown Source)
                at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:256)
                at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:221)
                at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:49)
                at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:49)
                at scala.Option.fold(Option.scala:158)
                at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:49)
                at com.crealytics.spark.excel.WorkbookReader$class.withWorkbook(WorkbookReader.scala:14)
                at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:45)
                at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:32)
                at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:32)
                at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104)
                at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103)
                at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:172)
                at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:171)
                at scala.Option.getOrElse(Option.scala:121)
                at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:171)
                at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:36)
                at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:36)
                at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13)
                at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8)

We are using the version 0.13.7 of the artefact spark-excel_2.11.

Is there something we can do to deal with this issue ?

Thanks,

Oct 08 '21 13:10 amazoyer

Hi @amazoyer Is it possible for you to share an excel file (by removing sensitive data or by generating one) that can reproduce this issue?

In other hand, spark-excel are having number of reported tickets with large file/OOM issue: https://github.com/crealytics/spark-excel/labels/large

So, if you can help to reproduce this issue, that'd be great help.

PS: I did tested excel files (both .xls and .xlsx, will update the wiki soon), by generating 100k rows of data, spark-excel (API v2, 0.14.0), need to set appropriate spark's configuration for executor/driver memory, however, spark-excel consumes no more than the Apache POI for given test files.

Oct 08 '21 19:10 quanghgx

@amazoyer , you could try to set "maxRowsInMemory", it is more gentle to Garbage Collector.

I think it happens when the file is not only have a lot of rows, but also a lot of columns. Default reader creates temporary entity for each cell and this unnecessary stresses Garbage Collector.

Nov 03 '21 00:11 EnverOsmanov

I was testing with an Excel file of 1GB with 10 similarly generated sheets: 12 columns, each column has random strings of equal length and the number of rows is just enough to get the desired size of the workbook (equal number of rows in each sheet).

In my case, I was using API v1 and it seems like this line:

val result = rows.to[Vector]

in the ExcelRelation.scala (https://github.com/crealytics/spark-excel/blob/master/src/main/scala/com/crealytics/spark/excel/ExcelRelation.scala#L76) loads all rows from the sheet into the memory. It doesn't fail in my case by it significantly reduces performance because now even if I need just a few rows, I still would load the entire set into the memory even if I set maxRowsInMemory to some small value.

Mar 23 '22 18:03 kvirund

@kvirund correct. At some point the data has to be put into an RDD, and SparkContext afaik only provides the parallelize method which takes a Seq. There are other ways to do this, e.g. like the KafkaBatch class, but I never had the need to read such huge files. If anybody would be willing to give it a try, we're open to PRs.

Mar 23 '22 19:03 nightscape

https://github.com/crealytics/spark-excel/compare/master...kvirund:patch-1

This works for me. I am still running performance tests but reading of that 1GB file went through just fine.

Mar 23 '22 20:03 kvirund

spark-excel spark-excel copied to clipboard

GC overhead while ingesting big excel files

spark-excel
spark-excel copied to clipboard