spark-excel icon indicating copy to clipboard operation
spark-excel copied to clipboard

PR Proposal: dataframe.write.format("excel") supports SaveMode.Overwrite/Append and partitioning

Open christianknoepfle opened this issue 3 years ago • 8 comments

Expected Behavior

dataframe.write.format("excel") supports SaveMode.Overwrite/Append and partitioning in the same way as other data sources like csv

Current Behavior

SaveMode and partitioning doesn't work as expected, see issue #539 and issue #547

Possible Solution

Despite the comment in ExcelDataSource I suggest to change the definition of ExcelDataSource from

class ExcelDataSource extends TableProvider with DataSourceRegister

to

class ExcelDataSource extends FileDataSourceV2

This allows us to remove most of the copied code from spark. The code is very similar to the spark implementation of the FileDataSourceV2 for csv

In addition we provide a simple implementation for the fallbackFileFormat that just supports writing. So we have

class ExcelFileFormat extends FileFormat with DataSourceRegister {
...
  override def prepareWrite ...
...

the prepareWrite() is basically taken from ExcelWriteBuilder()

So as of now the main changes are:

  • src/main/3.x/scala/com/crealytics/spark/v2/excel/ExcelDataSource.scala
  • src/main/3.x/scala/com/crealytics/spark/v2/excel/ExcelFileFormat.scala

You can see all changes here:: https://github.com/christianknoepfle/spark-excel/pull/1/files

There are some very basic unit tests for savemode and partitioning and they work as expected

I only tested it on spark 3.0.3, so I guess some more work needs to be done. Furthermore some better testing and cleanup. The provided unit test are mostly ok, I only get some issues on probably non related stuff: PlainNumberReadSuite , ErrorAsStringsReadSuite (I am using Windows for testing, not sure if this is an explanation)

Please let me know if this makes sense to you and how we can move forward.

Thanks

christianknoepfle avatar Feb 28 '22 19:02 christianknoepfle

SHort update. I ran the changes with the ci.yml pipeline and the new tests were ok for >=3.0.3 (the existing tests run all good)

Surprisingely the spark API changed from 3.0.0 to 3.0.3 so the changes did not compile for 3.0.0 (in fact it changed from 3.0.0 to 3.0.1). I would need to change the structure of the src folders to make it work. To me this does not really make a lot of sense. Is it a prerequisite for the PR to still support 3.0.0 (while 3.0.1/3.0.3 work fine)?

For the 2.x line the new tests failed because I haven't ported the fixes over to it. Is that something the PR would need to include or can we also scope some tests to be 3.x exclusive. Just want to know how much effort is needed.

ci results are here: https://github.com/christianknoepfle/spark-excel-clone/runs/5427239976?check_suite_focus=true

christianknoepfle avatar Mar 04 '22 20:03 christianknoepfle

@christianknoepfle thanks for the effort! Afair @quanghgx had mentioned the same issues regarding changes from Spark 3.0.0 to 3.0.x. From my side it's fine to exclude 3.0.0 and target only 3.0.1 upwards. I'm also fine with adding 2.x tests later, as long as there are no regressions in the library itself.

nightscape avatar Mar 06 '22 08:03 nightscape

Hi, I did some cleanup on the code, checked the CI and all checks passed. I removed the 3.0.0 from the testing ci and replaced it with 3.0.1. The test for writing a partitioned file structure are only executed on spark >=3.0.1, all other new tests worked fine with the 2.4.x tests (so the whole SaveMode.Overwrite issue was probably broken for spark 3 only)

I opened a PR for the changes and hope that this was ok. If something need change or further test coverage, please let me know. As soon as I can pull spark-excel with that change from maven repo I will try it out in my "weekday professional environment" ;)

christianknoepfle avatar Mar 06 '22 14:03 christianknoepfle

Hi, great to see the pr merged into main. In order to use the feature in my business environment, I would need it as a maven artifact but I cannot see the 0.17 there or an updated pre-version. Is it not creeated or am I looking at the wrong place?

christianknoepfle avatar Apr 19 '22 07:04 christianknoepfle

@christianknoepfle which Spark and Scala version are you using? I see it listed here for most versions.

nightscape avatar Apr 25 '22 21:04 nightscape

Hi @nightscape , at the time I wrote the comment the versions where not shown, now I see them. It works fine on our integration server and locally :) Unfortunately our AWS EMR is still on spark 3.0.0 and this is the version that doesn't with spark-excel at all (too many code changes were needed and I didn't wanted to do that). So I need to talk to the team to upgrade our EMR to some newer version.

Since EMR is available for 3.0.1 / 3.1.1 / 3.1.2 (as of now) - wouldn't it make sense to add 3.0.1 and 3.1.1 to the versions built so people are not forced to upgrade?

christianknoepfle avatar Apr 27 '22 14:04 christianknoepfle

Hi @christianknoepfle, could you check if 3.0.3_0.17.0 is compatible with Spark 3.0.1? Usually (except for the ominous 3.0.0 release) plugins should be compatible with different patch versions of Spark. It might make sense to truncate the Spark part of the version number to major.minor to reflect that fact.

nightscape avatar Apr 27 '22 14:04 nightscape

@nightscape , 3.0.1 and 3.0.3 worked fine on our local servers. I did only little testing but for now that looked ok. Since I am away next week it will take a bit until I can give you some feedback (and if it works with 3.0.1 EMR). From what I understand from here (https://spark.apache.org/versioning-policy.html) spark does not guarantee binary compatibility. They try, but are not ensuring it. I like your idea, but not sure if it works out at the end...

christianknoepfle avatar Apr 27 '22 15:04 christianknoepfle