[BUG] Spark Excel does not work on AWS EMR (even with thin assembly)
Am I using the newest version of the library?
- [x] I have made sure that I'm using the latest version of the library.
Is there an existing issue for this?
- [x] I have searched the existing issues
Current Behavior
For the past time I used my own compiled spark excel jar, because the default one did not work on AWS EMR (latest version 7.9.0).
After the "thin" assembly was introduced I did a first test and it worked very well for reading excel. But occasionally we also write excel and there it fails with NoSuchMethod.
Expected Behavior
The main reason for the failure is the presence of pre installed hadoop 3.4.1 jars on AWS EMR. hadoop 3.4.1 (is latest version) utilizes commons-compress version 1.22. poi also needs commons-compress and uses API introduced with 1.25 (currently binds 1.27). When running spark-submit the class loader pulls the hadoop provided commons-compress first,so we do not have the needed API function and the job crashes.
I assume this issue holds true for any other "Spark Cluster service provided by your favourite cloud provider".
Now we could try to force some other class load ordering (haven't investigated that) or patching the EMR installation (that will be tricky), but I guess "shading" would be the best option here. Or has someone a better idea how to cope with it?
If shading is the way to go I would suggest to also offer a classifier for "emr" (or a more generic name, because this issue will come up for other cluster environments too). @nightscape what are your thoughts on this?
BR
Christian
Steps To Reproduce
Here are some details on the issue:
The error message:
The change in commons-compress (materialized in 1.25.0):
Environment
- Spark version: 3.5.5
- Spark-Excel version: 3.5.6_0.31.2
- OS: Amazon Linux 2023
- Cluster environment: AWS EMR 7.9
Anything else?
No response
Hi @christianknoepfle,
commons-compress is haunting us 😅 I sketched a solution here: https://github.com/nightscape/spark-excel/pull/741#issuecomment-3117737583 Would you mind giving this a try?
Oops, should have searched and not just checked the open issues. Yes, I can try. So basically the idea is that thin gets commons-compress shaded, potentially other libs too? Or should that be a different classifier?
I just created a release which bundles commons-compress and POI into thin JARs as well: https://github.com/nightscape/spark-excel/releases/tag/v0.32.1-alpha1 Once it's finished building and publishing, would you mind giving it a try?
Yes will do, but earliest on Tuesday because I am traveling --
I tried but got a ton of shading issues when building the assembly. Need some time to investigate and then will come back to you