spark-excel
spark-excel copied to clipboard
Writing a large Dataset into an Excel file causes `java.lang.OutOfMemoryError: GC overhead limit exceeded`
Hi All,
I am trying to write a relatively large Spark Dataset (> 50000 rows) into an Excel file using spark-excel as follows:
DataFrameWriter<Row> writer = dataset.write()
.format("com.crealytics.spark.excel")
.option("sheetName", sheets.get(0)) // Only first sheet
.option("useHeader", useHeader)
.mode("overwrite");
This works fine if the Dataset contains around 20000 rows, but from approximately 35000 rows, the code yields java.lang.OutOfMemoryError: GC overhead limit exceeded
. It seems that the easiest temporary solution would be bumping up the allocated memory for JVM, but I would like to avoid that since bigger Dataset may come in and cause the same issue. Any thoughts on this? Thanks.
Currently we're using Spoiwo to write Excel files. There is an open issue to use SXSSF but that would require major changes. So the options would be to create a PR in Spoiwo or replace Spoiwo itself with something else that is capable of streaming. I don't have time to do this myself, but I'd try to be helpful if you want to give it a try.
Thanks for the comment @nightscape.
We decided that implementing the simplest version of Excel writer with SXSSF ourselves is the best option due to the time constraint. So I am going to close this issue for now.
Hello, I am facing the same issue. How can I do that?. I mean, use the SXSSF. We are trying to write 500000 rows into xlsx and it is impossible. Thanks in advance
It seems that Spoiwo integrated the SXSSF writing. Unfortunately, I don't have time to look into how to actually use it, but if someone wants to create a PR for it, I'd be happy to review.
Hi @AntonioNieto92 and @kansuke231 If is it possible, please help share your excel file (after removing sensitive data)? Or steps to generate (column set, number of rows) an excel file that can reproduce this issue? Sincerely,