spark-excel icon indicating copy to clipboard operation
spark-excel copied to clipboard

Driver is up but is not responsive, likely due to GC!!! when writing dataframe to a big excel file into the blob storage

Open sabrishami opened this issue 3 years ago • 2 comments

Your issue may already be reported!

I try to write spark dataframe to excel file on blob storage.

df.repartition(1).write.format("com.crealytics.spark.excel") .mode("overwrite") .option("header", "true") .option("maxRowsInMemory", 1000) .save("/mnt/IngestExelFiles/output_fulldf.xlsx")

when the data frame has more than 200,000 rows, I see the Driver is up but is not responsive, likely due to GC (databricks)

environment: 8.4 (includes Apache Spark 3.1.2, Scala 2.12) Driver type: 56 GB Memory, 8 cores

I could read the big excel file from blob storage, but writing the same table doesn't work!!!

Is there any clue?

Thanks

sabrishami avatar Jan 14 '22 15:01 sabrishami

Hi @sabrishami How about helping us prepare the df (generating is fine) so we can reproduce the issue on our side? I don't have ready access to databrick, however, I can run it on a local machine and observe the resource usage?

quanghgx avatar Jan 15 '22 12:01 quanghgx

should be fixed by 0.18.0 - the excel v2 data source now has support for maxRowsInMemory setting and this lowers the memory overheas

pjfanning avatar Sep 17 '22 12:09 pjfanning

@sabrishami can you check if 0.18.0 with .format("excel").option("maxRowsInMemory") fixes the issue? I'd close the issue for now, should it not work please post a comment and I'll reopen.

nightscape avatar Oct 27 '22 07:10 nightscape