spark-excel [FEATURE] Optimize JAR size

Is there an existing issue for this?

[x] I have searched the existing issues

Current Behavior

Hello, I was checking your library and I realised that the final jar size is quite big (30MB). The library is very useful but it's very heavy to include it in fat jar, maybe it's possible to optimize a bit. Checking on maven I noticed that:

you are including scala as compile dependency but I think it could be marked as provided
you are including both poi-ooxml and poi-ooxml-lite (I think that only one of these should be included checking the FAQ#3

I don't know if these tips can help

Expected Behavior

No response

Steps To Reproduce

No response

Environment

No response

Anything else?

No response

Apr 23 '24 10:04 alessandrorimoldi

Hi @alessandrorimoldi, do you have any restrictions that make the 30MB too big? There might be a few tweaks to reduce the file size, but I've already managed to mess up the Uber-JAR packaging too many times to be motivated to try again 😉

Apr 24 '24 10:04 nightscape

Hi @nightscape, not a real restriction but I wanted to use your library in one of my libraries that is shared among all my projects and, since I'm creating fat jars for spark and then uploading them on the cluster, the 30MB extra is a bit annoying for my use case. I don't know if you have already tried in the past the two things I have listed above but if they work you should be able to reduce the jar size by 10MB and it's a not a bad starting point.

Apr 24 '24 10:04 alessandrorimoldi

Hi @nightscape - we use spark-excel at my work and we've just noticed that the dependency jar brings a lot of other libraries in the Uber-JAR with the original package names. This causes havoc with some tests and is a bit of a risk if you're bringing it into software projects and not just using it as a portable jar that you add to a spark-submt or spark-shell. I was wondering if you'd consider shading all of the classes or publishing a thin jar as well for people who don't want all of those extra library classes in there? Happy to do my best to help with a PR if you're open to it.

Apr 16 '25 17:04 q-willboulter

I've tested and this should be as simple as adding classifier = Some("assembly") to the extraPublish definition: https://github.com/nightscape/spark-excel/blob/main/build.mill#L74

That would mean that two jars are produced. The current uber-jar which has a -assembly classifier on the name, and the thin one which only has the dev/mauch classes.

I've used "assembly" in the above example as I think that's the standard, but @q-willboulter might know better!

Apr 22 '25 18:04 q-benstewart

I've opened a PR with the proposed change to resolve this, as in that thread there I am happy to add/update docs as needed!

Apr 22 '25 19:04 bastewart