excel-streaming-reader
excel-streaming-reader copied to clipboard
This dataset causes out of memory error
This dataset runs for several hours and eventually runs out of heap space
Possibly the issue is the large number of unique strings?
the dataset contains many unique string values
it looks like "SharedStringsTable" is getting too large. Is there any way to avoid this?
java.lang.OutOfMemoryError: Java heap space at org.apache.xmlbeans.impl.store.Saver$TextSaver.resize(Saver.java:1700) at org.apache.xmlbeans.impl.store.Saver$TextSaver.preEmit(Saver.java:1303) at org.apache.xmlbeans.impl.store.Saver$TextSaver.emit(Saver.java:1190) at org.apache.xmlbeans.impl.store.Saver$TextSaver.emitElement(Saver.java:962) at org.apache.xmlbeans.impl.store.Saver.processElement(Saver.java:476) at org.apache.xmlbeans.impl.store.Saver.process(Saver.java:307) at org.apache.xmlbeans.impl.store.Saver$TextSaver.saveToString(Saver.java:1864) at org.apache.xmlbeans.impl.store.Cursor._xmlText(Cursor.java:546) at org.apache.xmlbeans.impl.store.Cursor.xmlText(Cursor.java:2436) at org.apache.xmlbeans.impl.values.XmlObjectBase.xmlText(XmlObjectBase.java:1500) at org.apache.poi.xssf.model.SharedStringsTable.getKey(SharedStringsTable.java:135) at org.apache.poi.xssf.model.SharedStringsTable.readFrom(SharedStringsTable.java:125) at org.apache.poi.xssf.model.SharedStringsTable.<init>(SharedStringsTable.java:107) at org.apache.poi.xssf.eventusermodel.XSSFReader.getSharedStringsTable(XSSFReader.java:114) at com.monitorjbl.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:120) at com.monitorjbl.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:91) at com.monitorjbl.xlsx.StreamingReader$Builder.open(StreamingReader.java:263)
Did you try setting the StreamingReader.builder().sstCacheSize(numberOfItemsToKeepInMemory) From what I saw your file has count="44612922" uniqueCount="3216056" items
Try setting the number as high as your application can handle as this will affect the performance of the file parsing
I think that the sstCacheSize option was removed
What version of the library are you using? The latest version should have that.
I am using the latest. If I add the option I get the runtime error below. Is this an unsupported excel version?
java.lang.NoClassDefFoundError: org/openxmlformats/schemas/spreadsheetml/x2006/main/CTPhoneticRun at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2701) at java.lang.Class.getDeclaredMethods(Class.java:1975) at shadeio.databind.introspect.AnnotatedClass._findClassMethods(AnnotatedClass.java:1213) at shadeio.databind.introspect.AnnotatedClass._addMemberMethods(AnnotatedClass.java:741) at shadeio.databind.introspect.AnnotatedClass._resolveMemberMethods(AnnotatedClass.java:549) at shadeio.databind.introspect.AnnotatedClass.resolveMemberMethods(AnnotatedClass.java:541) at shadeio.databind.introspect.AnnotatedClass.memberMethods(AnnotatedClass.java:340) at shadeio.databind.introspect.POJOPropertiesCollector._addMethods(POJOPropertiesCollector.java:512) at shadeio.databind.introspect.POJOPropertiesCollector.collectAll(POJOPropertiesCollector.java:303) at shadeio.databind.introspect.POJOPropertiesCollector.getJsonValueMethod(POJOPropertiesCollector.java:173) at shadeio.databind.introspect.BasicBeanDescription.findJsonValueMethod(BasicBeanDescription.java:224) at shadeio.databind.ser.BasicSerializerFactory.findSerializerByAnnotations(BasicSerializerFactory.java:350) at shadeio.databind.ser.BeanSerializerFactory._createSerializer2(BeanSerializerFactory.java:219) at shadeio.databind.ser.BeanSerializerFactory.createSerializer(BeanSerializerFactory.java:168) at shadeio.databind.SerializerProvider._createUntypedSerializer(SerializerProvider.java:1308) at shadeio.databind.SerializerProvider._createAndCacheUntypedSerializer(SerializerProvider.java:1258) at shadeio.databind.SerializerProvider.findValueSerializer(SerializerProvider.java:500) at shadeio.databind.SerializerProvider.findTypedValueSerializer(SerializerProvider.java:698) at shadeio.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:270) at shadeio.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:3697) at shadeio.databind.ObjectMapper.writeValueAsBytes(ObjectMapper.java:3097) at com.monitorjbl.xlsx.sst.FileBackedList.writeToFile(FileBackedList.java:83) at com.monitorjbl.xlsx.sst.FileBackedList.add(FileBackedList.java:61) at com.monitorjbl.xlsx.sst.BufferedStringsTable.readFrom(BufferedStringsTable.java:42) at com.monitorjbl.xlsx.sst.BufferedStringsTable.<init>(BufferedStringsTable.java:30) at com.monitorjbl.xlsx.sst.BufferedStringsTable.getSharedStringsTable(BufferedStringsTable.java:25) at com.monitorjbl.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:118) at com.monitorjbl.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:91) at com.monitorjbl.xlsx.StreamingReader$Builder.open(StreamingReader.java:263) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$1.apply(ExcelRelation.scala:60) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$1.apply(ExcelRelation.scala:54) at scala.Option.map(Option.scala:146) at com.crealytics.spark.excel.ExcelRelation.openWorkbook(ExcelRelation.scala:54) at com.crealytics.spark.excel.ExcelRelation.com$crealytics$spark$excel$ExcelRelation$$getExcerpt(ExcelRelation.scala:66) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:281) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:280) at scala.Option.getOrElse(Option.scala:121) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:280) at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:100) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:38) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:14) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135) at com.mineset.spark.dataaccess.DataAccess.readExcel(DataAccess.scala:507) at com.mineset.spark.datavault.PreviewDataset.previewXls(PreviewDataset.scala:250) at com.mineset.spark.datavault.PreviewDataset.execute(PreviewDataset.scala:171) at com.mineset.spark.datavault.PreviewDataset.execute(PreviewDataset.scala:132) at com.mineset.spark.common.util.CommandProcessor$$anonfun$process$2.apply(CommandProcessor.scala:60) at com.mineset.spark.common.util.CommandProcessor$$anonfun$process$2.apply(CommandProcessor.scala:58) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at com.mineset.spark.common.util.CommandProcessor.process(CommandProcessor.scala:58) at com.mineset.spark.datavault.DataVault.processCommands(DataVault.scala:69) at com.mineset.spark.datavault.DataVaultJob$.runJob(DataVaultJob.scala:36) at com.mineset.spark.datavault.DataVaultJob$.runJob(DataVaultJob.scala:20) at spark.jobserver.JobManagerActor$$anonfun$getJobFuture$4.apply(JobManagerActor.scala:292) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: org.openxmlformats.schemas.spreadsheetml.x2006.main.CTPhoneticRun at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 69 more Exception in thread "pool-1-thread-7" java.lang.NoClassDefFoundError: org/openxmlformats/schemas/spreadsheetml/x2006/main/CTPhoneticRun at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2701) at java.lang.Class.getDeclaredMethods(Class.java:1975) at shadeio.databind.introspect.AnnotatedClass._findClassMethods(AnnotatedClass.java:1213) at shadeio.databind.introspect.AnnotatedClass._addMemberMethods(AnnotatedClass.java:741) at shadeio.databind.introspect.AnnotatedClass._resolveMemberMethods(AnnotatedClass.java:549) at shadeio.databind.introspect.AnnotatedClass.resolveMemberMethods(AnnotatedClass.java:541) at shadeio.databind.introspect.AnnotatedClass.memberMethods(AnnotatedClass.java:340) at shadeio.databind.introspect.POJOPropertiesCollector._addMethods(POJOPropertiesCollector.java:512) at shadeio.databind.introspect.POJOPropertiesCollector.collectAll(POJOPropertiesCollector.java:303) at shadeio.databind.introspect.POJOPropertiesCollector.getJsonValueMethod(POJOPropertiesCollector.java:173) at shadeio.databind.introspect.BasicBeanDescription.findJsonValueMethod(BasicBeanDescription.java:224) at shadeio.databind.ser.BasicSerializerFactory.findSerializerByAnnotations(BasicSerializerFactory.java:350) at shadeio.databind.ser.BeanSerializerFactory._createSerializer2(BeanSerializerFactory.java:219) at shadeio.databind.ser.BeanSerializerFactory.createSerializer(BeanSerializerFactory.java:168) at shadeio.databind.SerializerProvider._createUntypedSerializer(SerializerProvider.java:1308) at shadeio.databind.SerializerProvider._createAndCacheUntypedSerializer(SerializerProvider.java:1258) at shadeio.databind.SerializerProvider.findValueSerializer(SerializerProvider.java:500) at shadeio.databind.SerializerProvider.findTypedValueSerializer(SerializerProvider.java:698) at shadeio.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:270) at shadeio.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:3697) at shadeio.databind.ObjectMapper.writeValueAsBytes(ObjectMapper.java:3097) at com.monitorjbl.xlsx.sst.FileBackedList.writeToFile(FileBackedList.java:83) at com.monitorjbl.xlsx.sst.FileBackedList.add(FileBackedList.java:61) at com.monitorjbl.xlsx.sst.BufferedStringsTable.readFrom(BufferedStringsTable.java:42) at com.monitorjbl.xlsx.sst.BufferedStringsTable.<init>(BufferedStringsTable.java:30) at com.monitorjbl.xlsx.sst.BufferedStringsTable.getSharedStringsTable(BufferedStringsTable.java:25) at com.monitorjbl.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:118) at com.monitorjbl.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:91) at com.monitorjbl.xlsx.StreamingReader$Builder.open(StreamingReader.java:263) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$1.apply(ExcelRelation.scala:60) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$1.apply(ExcelRelation.scala:54) at scala.Option.map(Option.scala:146) at com.crealytics.spark.excel.ExcelRelation.openWorkbook(ExcelRelation.scala:54) at com.crealytics.spark.excel.ExcelRelation.com$crealytics$spark$excel$ExcelRelation$$getExcerpt(ExcelRelation.scala:66) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:281) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:280) at scala.Option.getOrElse(Option.scala:121) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:280) at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:100) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:38) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:14) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135) at com.mineset.spark.dataaccess.DataAccess.readExcel(DataAccess.scala:507) at com.mineset.spark.datavault.PreviewDataset.previewXls(PreviewDataset.scala:250) at com.mineset.spark.datavault.PreviewDataset.execute(PreviewDataset.scala:171) at com.mineset.spark.datavault.PreviewDataset.execute(PreviewDataset.scala:132) at com.mineset.spark.common.util.CommandProcessor$$anonfun$process$2.apply(CommandProcessor.scala:60) at com.mineset.spark.common.util.CommandProcessor$$anonfun$process$2.apply(CommandProcessor.scala:58) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at com.mineset.spark.common.util.CommandProcessor.process(CommandProcessor.scala:58) at com.mineset.spark.datavault.DataVault.processCommands(DataVault.scala:69) at com.mineset.spark.datavault.DataVaultJob$.runJob(DataVaultJob.scala:36) at com.mineset.spark.datavault.DataVaultJob$.runJob(DataVaultJob.scala:20) at spark.jobserver.JobManagerActor$$anonfun$getJobFuture$4.apply(JobManagerActor.scala:292) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: org.openxmlformats.schemas.spreadsheetml.x2006.main.CTPhoneticRun at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 69 more
ok I see I need ooxml-schemas.. I will give it a try
for some reason, when I use sstCacheSize with this dataset, I now have to use the full ooxml-schemas instead of the reduced set of poi-ooxml-schemas
I was able to get it running but it still fails to load the dataset.. with sstCacheSize(100)