spark-excel icon indicating copy to clipboard operation
spark-excel copied to clipboard

[Read an Excel File]: GC overhead limit exceeded

Open cedtnee opened this issue 4 years ago • 14 comments

Expected Behavior

I try to read an excel file size 35MB and write the result as orc files The input file have one sheet which have only values, not functions or macros

Current Behavior

On my readerExcel.scala , i do this : val df = spark.read .format("com.crealytics.spark.excel") .option("header", "true") // Required .option("treatEmptyValuesAsNulls", "false") // Optional, default: true .option("inferSchema", "false") // Optional, default: false .option("sheetName", "Feuil1") .schema(StructType( fields .map(fieldName => StructField(fieldName, StringType, nullable = true)))) // Optional, default: Either inferred schema, or all columns are Strings .load("file.xlsx")

I have this error Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.regex.Pattern.matcher(Pattern.java:1093) at shadeio.poi.ss.util.CellReference.separateRefParts(CellReference.java:396) at shadeio.poi.ss.util.CellReference.(CellReference.java:113) at shadeio.poi.xssf.usermodel.XSSFCell.(XSSFCell.java:118) at shadeio.poi.xssf.usermodel.XSSFRow.(XSSFRow.java:75) at shadeio.poi.xssf.usermodel.XSSFSheet.initRows(XSSFSheet.java:268) at shadeio.poi.xssf.usermodel.XSSFSheet.read(XSSFSheet.java:231) at shadeio.poi.xssf.usermodel.XSSFSheet.onDocumentRead(XSSFSheet.java:218) at shadeio.poi.xssf.usermodel.XSSFWorkbook.parseSheet(XSSFWorkbook.java:448) at shadeio.poi.xssf.usermodel.XSSFWorkbook.onDocumentRead(XSSFWorkbook.java:413) at shadeio.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184) at shadeio.poi.xssf.usermodel.XSSFWorkbook.(XSSFWorkbook.java:282) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:88) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:135) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at shadeio.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:339) at shadeio.poi.ss.usermodel.WorkbookFactory.createXSSFWorkbook(WorkbookFactory.java:314) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:232) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:198) at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:50) at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:50) at scala.Option.fold(Option.scala:158) at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:50) at com.crealytics.spark.excel.WorkbookReader$class.withWorkbook(WorkbookReader.scala:14) at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:46) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:30) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:30) at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104) at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103) 2020-11-26 21:36:57,729 [Driver] INFO org.apache.spark.deploy.yarn.ApplicationMaster - Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.io.IOException: GC overhead limit exceeded at shadeio.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:351) at shadeio.poi.ss.usermodel.WorkbookFactory.createXSSFWorkbook(WorkbookFactory.java:314) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:232) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:198) at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:50) at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:50) at scala.Option.fold(Option.scala:158) at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:50) at com.crealytics.spark.excel.WorkbookReader$class.withWorkbook(WorkbookReader.scala:14) at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:46) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:30) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:30) at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104) at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103) at com.crealytics.spark.excel.ExcelRelation.headerColumnForName$lzycompute(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.headerColumnForName(ExcelRelation.scala:32) at com.crealytics.spark.excel.ExcelRelation.com$crealytics$spark$excel$ExcelRelation$$columnExtractor(ExcelRelation.scala:51) at com.crealytics.spark.excel.ExcelRelation$$anonfun$4.apply(ExcelRelation.scala:61) at com.crealytics.spark.excel.ExcelRelation$$anonfun$4.apply(ExcelRelation.scala:61) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at com.crealytics.spark.excel.ExcelRelation.buildScan(ExcelRelation.scala:61) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:300) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:300) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:338) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:337) at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:415) at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:333) at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:296) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72) at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3248) at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)

Troubleshouting

I have increased the memory size but still getting the same issue I set spark options relative to Heap, but still getting the same issue

Your Environment

Spark version : Spark 2.3 language: Scala

  • Spark-Excel version: <groupId> com.crealytics</groupId> <artifactId>spark-excel_2.11</artifactId> 0.13.1

cedtnee avatar Nov 26 '20 23:11 cedtnee

Can you try something along the lines of .option("maxRowsInMemory", 20) (see the README)?

nightscape avatar Nov 30 '20 08:11 nightscape

When i used .option("maxRowsInMemory", 20),the result dataframe is empty. When i increased the .option("maxRowsInMemory", 30000) i have this error: An exception or error caused a run to abort: GC overhead limit exceeded java.lang.OutOfMemoryError: GC overhead limit exceeded at com.sun.xml.internal.stream.events.StartElementEvent.init(StartElementEvent.java:76) at com.sun.xml.internal.stream.events.StartElementEvent.(StartElementEvent.java:64) at com.sun.xml.internal.stream.events.XMLEventAllocatorImpl.getXMLEvent(XMLEventAllocatorImpl.java:76) at com.sun.xml.internal.stream.events.XMLEventAllocatorImpl.allocate(XMLEventAllocatorImpl.java:53) at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(XMLEventReaderImpl.java:84) at shadeio.monitorjbl.xlsx.impl.StreamingSheetReader.getRow(StreamingSheetReader.java:71) at shadeio.monitorjbl.xlsx.impl.StreamingSheetReader.access$200(StreamingSheetReader.java:32) at shadeio.monitorjbl.xlsx.impl.StreamingSheetReader$StreamingRowIterator.hasNext(StreamingSheetReader.java:402) at shadeio.monitorjbl.xlsx.impl.StreamingSheetReader$StreamingRowIterator.(StreamingSheetReader.java:396) at shadeio.monitorjbl.xlsx.impl.StreamingSheetReader.iterator(StreamingSheetReader.java:382) at shadeio.monitorjbl.xlsx.impl.StreamingSheet.iterator(StreamingSheet.java:49) at com.crealytics.spark.excel.AreaDataLocator$class.readFromSheet(DataLocator.scala:91) at com.crealytics.spark.excel.CellRangeAddressDataLocator.readFromSheet(DataLocator.scala:134) at com.crealytics.spark.excel.CellRangeAddressDataLocator.readFrom(DataLocator.scala:144) at com.crealytics.spark.excel.ExcelRelation$$anonfun$buildScan$2.apply(ExcelRelation.scala:63) at com.crealytics.spark.excel.ExcelRelation$$anonfun$buildScan$2.apply(ExcelRelation.scala:62) at com.crealytics.spark.excel.WorkbookReader$class.withWorkbook(WorkbookReader.scala:15) at com.crealytics.spark.excel.StreamingWorkbookReader.withWorkbook(WorkbookReader.scala:55) at com.crealytics.spark.excel.ExcelRelation.buildScan(ExcelRelation.scala:62) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:300) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:300) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:338) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:337) at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:415) at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:333) at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:296) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)

cedtnee avatar Nov 30 '20 14:11 cedtnee

I'm not surprised by the OutOfMemoryError, but getting an empty result is weird... Could you try some more values in between 20 and 30000?

nightscape avatar Dec 03 '20 00:12 nightscape

I used .option("maxRowsInMemory", 200), but it's still often to get an OOM error. My excel file is only 16MB. This is error message.
diagnostics: User class threw exception: java.lang.OutOfMemoryError: Java heap space at org.apache.xmlbeans.impl.store.Cur.createElementXobj(Cur.java:260) at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.startElement(Cur.java:2997) at org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3164) at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source) at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3422) at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272) at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259) at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) at org.openxmlformats.schemas.spreadsheetml.x2006.main.SstDocument$Factory.parse(Unknown Source) at shadeio.poi.xssf.model.SharedStringsTable.readFrom(SharedStringsTable.java:123) at shadeio.poi.xssf.model.SharedStringsTable.(SharedStringsTable.java:111) at shadeio.poi.xssf.eventusermodel.XSSFReader.getSharedStringsTable(XSSFReader.java:115) at shadeio.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:132) at shadeio.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:90) at shadeio.pjfanning.xlsx.StreamingReader$Builder.open(StreamingReader.java:307) at com.crealytics.spark.excel.StreamingWorkbookReader.openWorkbook(WorkbookReader.scala:63) at com.crealytics.spark.excel.WorkbookReader$class.withWorkbook(WorkbookReader.scala:14) at com.crealytics.spark.excel.StreamingWorkbookReader.withWorkbook(WorkbookReader.scala:54) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:31) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:31) at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:102) at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:101) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:163) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:162)

AlexZhang267 avatar Mar 18 '21 08:03 AlexZhang267

@AlexZhang267 I unfortunately can't invest much time into spark-excel at the moment. If possible, I would recommend to increase Spark's memory to work around this.

nightscape avatar Mar 18 '21 10:03 nightscape

I'm getting the same problem here. tried using "maxRowsinMemory" with values like 1000, 10000, 100000. It works fine for the first 3 excel files but I have 8 to process and the driver node always dies on the 4th excel file. Each file is roughly 600 MB each. I tried spark.driver.extraJavaOptions -XX:+UseG1GC and increased driver memory to 56 GB but the driver node still crashes. My understanding is the driver will crash if there's too many occurrences of full garbage collection. I also noticed the cluster doesn't scale up beyond 1 worker node even though I have auto scale set to 4 worker nodes. So clearly there is some type of collect() happening on the driver node with this library? and this library doesn't seem to use the spark parallelism very well?

kennydataml avatar Aug 05 '21 02:08 kennydataml

Hi @kennydataml , I am not sure if this is related. How do you read these 8 files? It will help if we have simplified (without sensitive data) and reproducible code. I am preparing a PR for native support of multiple files loading. Sincerely,

quanghgx avatar Aug 06 '21 01:08 quanghgx

I unfortunately can't get a scrubbed excel for you since it's on a client laptop. the code is simple. I only read in one excel at a time with a for loop. so basically

for xlpath in excels:
  csvpath = xlpath split join yadayda
  try:  # exception handling since we don't know the number of sheets
    for i in range(15):  # dynamic number of sheets
      df = (spark.read
        .format("crealytics ... spark excel yada yada")
        .option("dataAddress", f"{i}!A1")  # sub sheet index here
        .option("header", "true")
        .option("maxRowsInMemory", 100000)
        .load(xlpath)
      # write excel to csv
      (df.write
      .repartition(200)  # attempting to circumvent memory issues
      .format("csv")
      .mode("append")
      .option("header", "true")
      .save(csvpath)
      )
  except Exception as err:
    print(repr(err))

I've narrowed down the problem to only 1 of 8 excel files. I can consistently reproduce it on that particular excel file. It opens up just fine using microsoft excel, so I'm puzzled why only 1 particular excel file gives me an issue. The behaviour I'm observing is I can read the first 6 sheets, but it hangs on the 7th sheet when it tries to append write to the csv path. There is no spark job initiated for writing as expected, and it just hangs for an hour straight. I've tried running the code without looping through all excels, targeting only that particular excel file and it will hang consistently on the 7th sheet. It's really weird, I'm going to do some more testing tomorrow with a fresh copy of the data.

kennydataml avatar Aug 06 '21 02:08 kennydataml

I have the same issue and my excel file is only 2 MB, It also happens on some specific files

NestorAGC123 avatar Aug 16 '21 22:08 NestorAGC123

Hi @NestorAGC123 , Is it possible to share your excel file after removing sensible data?

quanghgx avatar Aug 17 '21 01:08 quanghgx

the problem is that spark-excel reads the file as an input stream and that uses far more memory than reading it as java.io.File - I have logged https://bz.apache.org/bugzilla/show_bug.cgi?id=65581 for a possible solution

pjfanning avatar Sep 18 '21 13:09 pjfanning

Does this problem still unclosed? I have the same problem.

yanghong avatar May 18 '22 11:05 yanghong

@yanghong maybe you could do a PR based on https://bz.apache.org/bugzilla/show_bug.cgi?id=65581 changes ?

pjfanning avatar May 18 '22 11:05 pjfanning

@yanghong maybe you could do a PR based on https://bz.apache.org/bugzilla/show_bug.cgi?id=65581 changes ?

Fine, it‘s poi’s limitaion.

yanghong avatar May 18 '22 12:05 yanghong