make it easy to read and write parquet files in java without depending on hadoop
I am happy to help with this but I'd love some guidance on:
- likelihood of being accepted as a patch.
- how critical it is to maintain backwards compatibility in APIs.
For instance, we probably want to introduce a new artifact that lives under the existing hadoop depending artifact, and move as much code as possible to that, keeping the hadoop apis in the old artifact.
Welcome comments on solving this issue.
Reporter: Oscar Boykin
Related issues:
- Make ParquetIO Read splittable (blocks)
- hadoop-common is not an optional dependency (is duplicated by)
- Avoid leaking Hadoop API to downstream libraries (incorporates)
- Add Java NIO Avro OutputFile InputFile (is related to)
PRs and other links:
Note: This issue was originally created as PARQUET-1126. Please see the migration documentation for further details.
Ryan Blue / @rdblue:
[~posco], I would really like to see Parquet create a non-Hadoop build. I just linked to another issue that adds all of the abstractions that I know we will need for this. The next step is to separate everything out of the Hadoop package and create a new module with the format-specific code that relies only on these new interfaces. That shouldn't be too hard, but there is a lot to do to get compression codecs that don't pull in Hadoop. If there's a way to use Hadoop compression libs without the rest, maybe that would be a good idea. Happy to have you help in this area, just ping me on PRs that you want to get reviewed. Thanks!
Sam Halliday: in PARQUET-1953 I left some code that means that the hadoop dep is hidden from the users of the API, but of course it still fails at runtime unless hadoop is on the dependency. Uses of Path and PathFilter seem to be the thing that is leaking.
David Mollitor / @belugabehr: Also check out some work done (Waiting in GitHub PR) [PARQUET-1776]
David Venable: Hello. We are using the Parquet project in our open-source project to read and write Parquet files. We have no particular need for Hadoop, aside from what Parquet pulls in. Is this something you are working toward? Do you have a known list of possible hindrances or blockers to help us understand some about the effort?
Is this issue really closed? From reading https://github.com/apache/parquet-java/issues/2938, it sounds like it's still not easy to read and write parquet files in java without depending on hadoop.
This issue (1497) feels like the most straightforward / canonical expression of this problem, and is linked to from places like https://stackoverflow.com/a/60067027/280852, so I think it would be valuable to re-open it to reflect the current status.
We have an implementation that supports this in the Trino codebase. See https://trino.io/blog/2025/02/10/old-file-system and https://github.com/trinodb/trino