parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

make it easy to read and write parquet files in java without depending on hadoop

Open asfimport opened this issue 8 years ago • 6 comments

I am happy to help with this but I'd love some guidance on:

  1. likelihood of being accepted as a patch.
  2. how critical it is to maintain backwards compatibility in APIs.

For instance, we probably want to introduce a new artifact that lives under the existing hadoop depending artifact, and move as much code as possible to that, keeping the hadoop apis in the old artifact.

Welcome comments on solving this issue.

Reporter: Oscar Boykin

Related issues:

PRs and other links:

Note: This issue was originally created as PARQUET-1126. Please see the migration documentation for further details.

asfimport avatar Oct 07 '17 01:10 asfimport

Ryan Blue / @rdblue: [~posco], I would really like to see Parquet create a non-Hadoop build. I just linked to another issue that adds all of the abstractions that I know we will need for this. The next step is to separate everything out of the Hadoop package and create a new module with the format-specific code that relies only on these new interfaces. That shouldn't be too hard, but there is a lot to do to get compression codecs that don't pull in Hadoop. If there's a way to use Hadoop compression libs without the rest, maybe that would be a good idea. Happy to have you help in this area, just ping me on PRs that you want to get reviewed. Thanks!

asfimport avatar Nov 08 '17 19:11 asfimport

Sam Halliday: in PARQUET-1953 I left some code that means that the hadoop dep is hidden from the users of the API, but of course it still fails at runtime unless hadoop is on the dependency. Uses of Path and PathFilter seem to be the thing that is leaking.

asfimport avatar Dec 15 '20 18:12 asfimport

David Mollitor / @belugabehr: Also check out some work done (Waiting in GitHub PR) [PARQUET-1776]

asfimport avatar Dec 15 '20 19:12 asfimport

David Venable: Hello. We are using the Parquet project in our open-source project to read and write Parquet files. We have no particular need for Hadoop, aside from what Parquet pulls in. Is this something you are working toward? Do you have a known list of possible hindrances or blockers to help us understand some about the effort?

asfimport avatar Jun 07 '24 21:06 asfimport

Is this issue really closed? From reading https://github.com/apache/parquet-java/issues/2938, it sounds like it's still not easy to read and write parquet files in java without depending on hadoop.

This issue (1497) feels like the most straightforward / canonical expression of this problem, and is linked to from places like https://stackoverflow.com/a/60067027/280852, so I think it would be valuable to re-open it to reflect the current status.

tarehart avatar Dec 16 '24 10:12 tarehart

We have an implementation that supports this in the Trino codebase. See https://trino.io/blog/2025/02/10/old-file-system and https://github.com/trinodb/trino

mosabua avatar Feb 10 '25 22:02 mosabua