spark-excel icon indicating copy to clipboard operation
spark-excel copied to clipboard

[BUG] Spark Excel does not work on AWS EMR (even with thin assembly)

Open christianknoepfle opened this issue 5 months ago • 5 comments

Am I using the newest version of the library?

  • [x] I have made sure that I'm using the latest version of the library.

Is there an existing issue for this?

  • [x] I have searched the existing issues

Current Behavior

For the past time I used my own compiled spark excel jar, because the default one did not work on AWS EMR (latest version 7.9.0).

After the "thin" assembly was introduced I did a first test and it worked very well for reading excel. But occasionally we also write excel and there it fails with NoSuchMethod.

Expected Behavior

The main reason for the failure is the presence of pre installed hadoop 3.4.1 jars on AWS EMR. hadoop 3.4.1 (is latest version) utilizes commons-compress version 1.22. poi also needs commons-compress and uses API introduced with 1.25 (currently binds 1.27). When running spark-submit the class loader pulls the hadoop provided commons-compress first,so we do not have the needed API function and the job crashes.

I assume this issue holds true for any other "Spark Cluster service provided by your favourite cloud provider".

Now we could try to force some other class load ordering (haven't investigated that) or patching the EMR installation (that will be tricky), but I guess "shading" would be the best option here. Or has someone a better idea how to cope with it?

If shading is the way to go I would suggest to also offer a classifier for "emr" (or a more generic name, because this issue will come up for other cluster environments too). @nightscape what are your thoughts on this?

BR

Christian

Steps To Reproduce

Here are some details on the issue: The error message: Image The change in commons-compress (materialized in 1.25.0): Image

Environment

- Spark version: 3.5.5
- Spark-Excel version: 3.5.6_0.31.2
- OS: Amazon Linux 2023
- Cluster environment: AWS EMR 7.9

Anything else?

No response

christianknoepfle avatar Jul 31 '25 07:07 christianknoepfle

Hi @christianknoepfle,

commons-compress is haunting us 😅 I sketched a solution here: https://github.com/nightscape/spark-excel/pull/741#issuecomment-3117737583 Would you mind giving this a try?

nightscape avatar Jul 31 '25 09:07 nightscape

Oops, should have searched and not just checked the open issues. Yes, I can try. So basically the idea is that thin gets commons-compress shaded, potentially other libs too? Or should that be a different classifier?

christianknoepfle avatar Jul 31 '25 10:07 christianknoepfle

I just created a release which bundles commons-compress and POI into thin JARs as well: https://github.com/nightscape/spark-excel/releases/tag/v0.32.1-alpha1 Once it's finished building and publishing, would you mind giving it a try?

nightscape avatar Aug 01 '25 12:08 nightscape

Yes will do, but earliest on Tuesday because I am traveling --

christianknoepfle avatar Aug 01 '25 12:08 christianknoepfle

I tried but got a ton of shading issues when building the assembly. Need some time to investigate and then will come back to you

christianknoepfle avatar Aug 09 '25 11:08 christianknoepfle