Java module import error due to shaded package shaded.parquet.it.unimi.dsi.fastutil
Description:
Due to collision of shaded packages
shaded.parquet.it.unimi.dsi.fastutil
in
org.apache.parquet:parquet-avro
and
org.apache.parquet:parquet-column
it is not possible to use both these dependencies within a modularized java project at the same time.
How to reproduce:
-
create a maven project with dependency org.apache.parquet:parquet-avro:1.11.1
-
declare java module that requires both parquet.avro and parquet.column
-
run
mvn compile
Expected behaviour:
Project should compile without errors.
Actual behaviour:
Project fails with compilation errors:
[ERROR] the unnamed module reads package shaded.parquet.it.unimi.dsi.fastutil from both parquet.column and parquet.avro
...
Reproducible example (same code as in the attached zip file): https://github.com/xCASx/parquet-example
Reporter: Maxim Kolesnikov
Original Issue Attachments:
Note: This issue was originally created as PARQUET-2035. Please see the migration documentation for further details.
Gabor Szadovszky / @gszadovszky:
[~cas], thanks for reporting this issue.
I don't have any experience in java11 modules. Since parquet-mr is still targeted to java8 and there are a couple of other projects use it without any issue (probably not in a java11 modularized environment) I would not say this is a bug. I would expect to have some workarounds for the java11 modularized environments since it is working without modules.
Meanwhile, I am happy to help/review for any contribution to parquet-mr that makes it work properly for java11 modules.
Maxim Kolesnikov: Hello @gszadovszky, Thank you for your quick reply.
Sorry, I should have make it clear that the issue is reproducible since Java 9, when java modules were added to the language specification. I updated example in my github repo to represent java 9 case.
While it's totally understandable, that there are clients who may still use java 8 and parquet-mr wants to keep compatibility, it is pity that features of newer versions are not supported yet. Java 9 spec was published in August of 2017, while sunsetting of Java 8 begun in 2019. Java 11 (next LTS release) available since September 2018, upcoming LTS should be available this September.
While I'd be glad to contribute to the solution, I lack context on the original intention for shading this dependency.
I see that change was initially introduced in parquet-column module with this commit: f98cd399f6c1f416835814fb17ec6a3d3080a389, no associated ticket. With PARQUET-1529 the library was shaded in parquet-avro and parquet-hadoop as well.
What would be implications of PARQUET-1529 roll back? Is this a valid option?
Gabor Szadovszky / @gszadovszky:
[~cas], I am not sure about the original purpose of shading. Usually, shading is implemented if a component relies on a specific version of dependency and do not want to add conflicts in a large ecosystem like Hadoop.
Even though java8 is already EOL it does not mean we cannot limit our source to be compatible with java8 and still built for that. Most of the environments where parquet-mr is used use java11 already but only for runtime. AFAIK, the whole Hadoop ecosystem is still stuck on java8 talking about the source level. So this is not about supporting java9 or java11 but supporting the the modules feature of java. As I've said, I don't have any experience with java modules so I am not sure about the actual problem and what would be the best fix for it.
Base on the original commit we shade fastutils since a while. PARQUET-1529 was only about keeping it shaded in all of our modules where it is used. By reverting PARQUET-1529 we would end up conflicting other fastutil versions in the Hadoop ecosystem.
We need someone who have experience with java modules to contribute here. It also seems a tough issue as we can hardly implement unit tests for such changes.
Maxim Kolesnikov: Yes, the original ticket and follow up conversation is about modules feature, there was no claim that Java 9 is not supported.
I'm not sure what quick solution options I have, if any. For that I need to dig deeper.
Though the long term solution would be to get rid of split package, as java modules doesn't allow those.
I see 4 ways of doing this:
-
Use unique names for shaded packages, e.g.
shaded.parquet.{color:#ff0000}avro{color}.it.unimi.dsi.fastutil,shaded.parquet.{color:#ff0000}column{color}.it.unimi.dsi.fastutil. May be not an option, as it may break application logic due to presence of multiple instances of the same classes. -
Get rid of dependency on
fastutilin any two out of three modules that are currently shading it. That may require significant code refactoring and may be not feasible. At least inparquet-hadoopit seems to be used in a single place for some performance optimisation. -
Achieve consistency of
fastutilacross hadoop projects. After that no shading is needed. Would be ideal, but probably even less feasible solution. -
Create a new module, e.g.
parquet-fastutilthat would contain only the shaded library. Add this module to transitive non shaded dependencies that have dependency onfastutil:parquet-avro,parquet-column,parquet-hadoop.What do you think? Do you see any other options?
Gabor Szadovszky / @gszadovszky:
Use unique names for shaded packages, e.g. shaded.parquet.avro.it.unimi.dsi.fastutil, shaded.parquet.column.it.unimi.dsi.fastutil. Probably not an option, as it may break application logic due to presence of multiple instances of the same classes.
I think it should work however, it would not be a nice solution. The classes wouldn't be the same because their package would be different.
Get rid of dependency on fastutil in any two out of three modules that are currently shading it. That may require significant code refactoring and may be not feasible. At least in parquet-hadoop it seems to be used in a single place for some performance optimisation.
As you've said we require fastutil for performance. Dropping fastutil from any place we use it would result in performance drawback.
Achieve consistency of fastutil across hadoop projects. Would be ideal, but probably even less feasible solution.
I would agree this is not possible. There are a lot of components in the ecosystem and much smaller efforts on similar issues were died already.
Create a new module, e.g. parquet-fastutil that would contain only the shaded library. Add this module to transitive non shaded dependencies that have dependency on fastutil: parquet-avro, parquet-column, parquet-hadoop.
I think this is our best option. We already have a separate module for jackson for the same purpose. What do you think about instead of creating a new module for fastutil we would rename the existing parquet-jackson module to parquet-3rdparty (or whatever better name) and it would contain all our dependencies that we would like to shade? The only issue with this approach is that we cannot minimize the jar during shading and fastutil is quite big (19M).
Maxim Kolesnikov:
Ah, didn't notice that you already have parquet-jackson.
Taking into account that shading fastutil where it is not needed already led to some issues reported by @Fokko in the past, I'd not mix shaded fastutil with jackson together, to avoid replicating the size related issue. In my opinion adding another module for fastutil will be fine. As there are just two such libraries that should not be a big maintenance overhead. Later on, if more libraries need to be shaded (I hope not), this approach may be reviewed accordingly.
Gabor Szadovszky / @gszadovszky:
The issue that @Fokko reported was solved by PARQUET-1853. The fix was simply adding <minimizeJar>true</minimizeJar> to the pom. The issue with the central shading is this option would not be available so we would re-introduce the problem. It is independent from we are using one dependency module or separate ones. Maybe, option1 would be better for fastutils. It is even easier to implement.
Could you please try out one of the solutions in your environment so we know this is the only issue we have with java modules?
Maxim Kolesnikov: Sure, I'll try it.
option 1, even if works today, may bring issues in future, if fastutil types will cross boundaries of libraries.
For example, parquet-avro depends on parquet-column. If at any point parquet-avro will cast or check instanceof of type received from parquet-column or vice versa, the logic will get broken, as two libraries operate the same classes, but located in different directories.
If there is a strict policy on avoiding leakage of fastutil classes through boundaries of parquet libraries, than option 1 will work now and in future.
Gabor Szadovszky / @gszadovszky: Got it, thanks for explaining. I guess, we don't have and don't need such constructs that would break the boundaries in case of fastutil. Meanwhile, I am not sure we can add proper checks to avoid them. However, it would be great if we could implement some tests that would ensure parquet-mr works fine in a java modularized environment. Do you have any idea to implement such thing?
Ismaël Mejía / @iemejia: Just for ref we had a similar issue reported on Beam and what we did was to create a 'test' module based on Java 11 that fails in case of split packages, for ref BEAM-8024 and https://github.com/apache/beam/tree/master/sdks/java/testing/jpms-tests
Maxim Kolesnikov: Thanks @iemejia, that's exactly what I was doing. I have a half baked patch now, going to finalize it till the end of this week.
@asfimport it is now June 2025, and I came across this ticket as I am also trying to use this library in a modularized application. The split package situation appears to not yet be solved. Is there any intent of fixing this, or do I have to create an uberjar myself with all the parquet-java dependencies?
ping @gszadovszky, see above
@rickardoberg, feel free to contribute here. I'm happy to review your potential PR. Unfortunately, I don't have the time to actually work on this.
BTW, what I do no really understand is that parquet-column is a dependency of parquet-avro. So anyone uses parquet-avro should have both jars on the classpath by default. How comes it was not an issue by anyone for 4+ years?
BTW, I do know that Apache Iceberg depends on both jars and is built on java 11 (at least). Never had a report from there about this issue.
@gszadovszky thanks for the followup. The issue is that the jar artifacts parquet-common and parquet-hadoop both contain the same shaded classes for fastutil, and so it becomes a split package problem when trying to run in a modular application. You can't even compile. If you run Java 11 with no modules, sure, then it works, but all of our other artifacts are modularized using JPMS.
Right now I am forced to create a shaded artifact to get around this (https://github.com/Cantara/parquet-java-shaded) which is not great.
I would prefer the solution where fastutil is handled like the shaded Jackson, i.e. a parquet-fastutil module which you can depend on from both parquet-common and parquet-hadoop.
@rickardoberg, thanks for the clarification. I agree with the proposed solution.