parquet-java Out of the box support for LocalOutputFile with ParquetWriter?

Hi,

It seems that even though it is possible to use ParquetReader with LocalInputFile out of the box, it is not possible to use ParquetWriter with LocalOutputFile.

This forces people to implement their own Builder (as it is abstract) and thus requires to add hadoop dependencies in the classpath.

It would be great if it was possible to do so, cheers!

Jul 01 '24 13:07 victornoel

Note that found https://github.com/apache/parquet-java/issues/2473 and https://github.com/apache/parquet-java/pull/1111: they both seem to contain comments about the fact that it is not working as expected, maybe even for reading…

Jul 01 '24 13:07 victornoel

I also found https://github.com/apache/parquet-java/issues/1497 that seems to be similar to my issue.

Jul 01 '24 14:07 victornoel

IIUC, there are still some gaps to totally remove Hadoop dependency. At least I have to depend on hadoop-client-api to make build happy.

cc @amousavigourabi @Fokko for advice.

Jul 04 '24 01:07 wgtmac

@wgtmac unfortunately I can't even "just" use hadoop-client-api because it's not functional by itself, it relies on shaded classes that are not actually included in the dependencies/classpath. So it fails at runtime.

Also my issue is not a question, there is some feature missing if I want to use LocalOutputFile with ParquetWriter.

Jul 04 '24 06:07 victornoel

Thanks for the clarification! I agree with you that I have ran into same issue. It seems that removing Hadoop dependency is only partially implemented. I need more time to investigate this topic. If you have any idea to resolve this, please feel free to open a PR.

Jul 04 '24 14:07 wgtmac

Hi, please be advised we do have some ParquetWriter implementations for use OOTB, such as AvroParquetWriter. AFAIK there is no implementation that is fully decoupled from using systems such as Avro. If you wish to avoid using any of these, you will unfortunately still have to create your own implementation. Leveraging our LocalOutputFile implementation avoids loading the HDFS Path class, which used to be an issue is some production environments. To be able to fully drop Hadoop runtime at the time, changes to the way writer and reader utils were configured and how data is (de)compressed were necessary, as these domains were (and still are) coupled incredibly tightly to Hadoop. We've made a good start on allowing for decoupled configurations through the ParquetConfiguration interface, though I believe the Hadoop Configuration class still needs to be loaded at one point. After the outstanding problems with the configuration and (de)compressors are resolved, usage of only the hadoop-client-api during build will be possible in the then supported contexts (i.e., only for (de)compressors with available alternative implementations). Until we've reached that point, we still have a runtime dependency on Hadoop in the project. This means you will have to maintain the dependencies on both Hadoop client API and runtime.

Jul 07 '24 16:07 amousavigourabi

Though this decoupling effort to move away from Hadoop is still very much so ongoing, the decoupling of the (de)compression logic has proved to be rather difficult. With concerns around backwards compatibility complicating the situation further (and forcing users to maintain a dependency on the Hadoop API dependency, even after all of this is resolved).

Jul 07 '24 17:07 amousavigourabi