parquet-format icon indicating copy to clipboard operation
parquet-format copied to clipboard

PARQUET-1950: Define core features

Open gszadovszky opened this issue 4 years ago • 22 comments

Make sure you have checked all steps below.

Jira

  • [x] My PR addresses the following PARQUET-1950 issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
    • https://issues.apache.org/jira/browse/PARQUET-XXX
    • In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Commits

  • [x] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • [x] In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

The whole document is up to discussion but the parts which are marked with a ? or TODO are the ones where I don't have a hard opinion. Feel free to add any comment about content or wording.

gszadovszky avatar Dec 11 '20 17:12 gszadovszky

@julienledem, @rdblue, could you please add your notes/ideas about this?

gszadovszky avatar Dec 14 '20 11:12 gszadovszky

For the record, we started documenting the features supported by parquet-cpp (which is part of the Arrow codebase).

pitrou avatar Dec 16 '20 14:12 pitrou

For the record, we started documenting the features supported by parquet-cpp (which is part of the Arrow codebase).

I've opened https://issues.apache.org/jira/browse/ARROW-11181 so we can do the same for the Rust implementation

nevi-me avatar Jan 08 '21 08:01 nevi-me

It would be nice to cover the status of external file column chunk (https://github.com/apache/arrow/pull/8130 was opened for C++)

emkornfield avatar Jan 31 '21 00:01 emkornfield

@emkornfield, in parquet-mr there was another reason to use the file_path in the footer. The feature is called summary files. The idea was to have a separate file containing a summarized footer of several parquet files so you might do filtering and pruning without even checking a file's own footer. As far as I know this implementation exists in parquet-mr only and there are no specification for it in parquet-format. This feature is more or less abandoned meaning during the development of some newer features (e.g. column indexes, bloom filters) the related parts might not updated properly. There were a couple of discussions about this topic in the dev list: here and here.

Because non of the ideas of external column chunks nor the summary files were spread across the different implementations (because of the lack of specification) I think we should not include the usage of the field file_path in this document or even explicitly specify that this field is not supported.

I am open to specify such features properly and after the required demonstration we may include them in a later version of the core features. However, I think these requirements (e.g. snapshot API, summary files) are not necessarily needed by all of our clients or already implemented in some ways (e.g. storing statistics in HMS, Iceberg).

gszadovszky avatar Feb 01 '21 09:02 gszadovszky

Because non of the ideas of external column chunks nor the summary files were spread across the different implementations (because of the lack of specification) I think we should not include the usage of the field file_path in this document or even explicitly specify that this field is not supported.

Being explicit seems reasonable to me if others are OK with it.

emkornfield avatar Feb 01 '21 20:02 emkornfield

@gszadovszky and @emkornfield it's highly coincidental that I was just looking into cleaning up apache/arrow#8130 when I noticed this thread. External column chunks support is one of the key features that attracted me to parquet in the first place and I would like the chance to lobby for keeping it and actually expanding its adoption - I already have the complete PR mentioned above and I can help with supporting it across other implementations. There are a few major domains where I see this as valuable component:

  1. Allowing concurrent read to fully flushed row groups while parquet file is still being appended to. A slight variant of this is allowing subsequent row group appends to a parquet file without impacting potential readers.
  2. Being able to aggregate multiple data sets in a master parquet file: One scenario if cumulative recordings like stock prices that get collected daily and need to be presented as one unified historical file, another the case of enrichment where we want to add new columns to an existing data set.
  3. Allowing for bi-temporal changes to parquet file: External columns chunks allows one to apply small corrections by simply creating delta files and new footers that simply swap out the chunks that require changes and point to the new ones.

If the above use cases are addressed by other parquet overlays or they don't line up with the intended usage of parquet I can look elsewhere but it seems like huge opportunity and the development cost for supporting it are quite minor by comparison

raduteo avatar Feb 01 '21 23:02 raduteo

@raduteo the main driver for this PR is there has been a lot of confusion as what is defined as needing core support. I think once we finish this PR I'm not fully opposed to the idea of supporting this field but I think we need to go into greater detail in the specification of what supporting the individual files actually means (and i think willing to help both Java and C++ support both can go a long way to convincing people that it should become a core feature).

emkornfield avatar Feb 02 '21 03:02 emkornfield

+1 to @emkornfield's comment - the intent of this is to establish a clear baseline about what is supported widely in practice - there are a bunch of Parquet features that are in the standard but are hard to use in practice because they don't have read support from other implementatoins. I think it should ultimately make it easier to get adoption on new features cause the status of each feature will be clearer.

timarmstrong avatar Feb 02 '21 04:02 timarmstrong

to add the parquet encryption angle to this discussion. This feature adds protection of confidentiality and integrity of parquet files (when they have columns with sensitive data). These security layers will make it difficult to support many of the legacy features mentioned above, like external chunks or merging multiple files into a single master file (this interferes with definition of file integrity). Reading encrypted data is also difficult before file writing is finished. All of these are not impossible, but challenging, and would require an explicit scaffolding plus some Thrift format changes. If there is a strong demand for using encryption with these legacy features, despite them being deprecated (or with some of the mentioned new features), we can plan this for future versions of parquet-format, parquet-mr etc.

ggershinsky avatar Feb 02 '21 07:02 ggershinsky

@ggershinsky, I think it is totally fine to say that encryption does not support external chunks and similar features even if they were fully supported by the implementations.

BTW, as we are already talking about the encryption I did not plan to include this features here for now. I think this feature is not mature enough yet and also it is not something that every implementation requires. Also, it might be a good candidate for a later release of core features.

gszadovszky avatar Feb 02 '21 08:02 gszadovszky

@gszadovszky I certainly agree the encryption feature is not ready yet to be on this list. According to the definition, we need to "have at least two different implementations that are released and widely tested". While we already have parquet-mr and parquet-cpp implementations, their release and testing status is not yet at that point. We can revisit this for a later version of the CoreFeatures.md.

ggershinsky avatar Feb 02 '21 08:02 ggershinsky

+1 to @emkornfield's comment - the intent of this is to establish a clear baseline about what is supported widely in practice - there are a bunch of Parquet features that are in the standard but are hard to use in practice because they don't have read support from other implementatoins. I think it should ultimately make it easier to get adoption on new features cause the status of each feature will be clearer.

Thank you @emkornfield and @timarmstrong for the clarifications! Btw, I am 100% in favor of the current initiative and I can relate to the world of pain one has to go through navigating parquet incompatibilities and I can definitely see how this can mitigate those issues while allowing the standard and underlying implementations to evolve.

raduteo avatar Feb 03 '21 00:02 raduteo

@gszadovszky Are you still looking forward to merging this PR?

shangxinli avatar Feb 07 '23 19:02 shangxinli

@shangxinli, I don't have enough time to work on it and also lost the contacts to Impala. This should be a cross community effort. It would be nice if someone could take it over but if there is no much interest it might not worth the effort.

gszadovszky avatar Feb 08 '23 08:02 gszadovszky

What isn't clear to me how we make sure we've reached all communities. To enumerate Parquet implementations I know about:

  • parquet-mr
  • impala
  • Rust (Arrow and Arrow2)
  • C++ (parquet-cpp/arrow)
  • Python (parquet-cpp/pyarrow and fastparquet).

What do we believe is needed to move forward?

emkornfield avatar Feb 13 '23 07:02 emkornfield

I don't know other implementation either. Since parquet-format is managed by this community I would expect the "implementors" to listen to the dev mailing list at least. I believe there is twitter channel for parquet managed by @julienledem. It might also help reaching parquet implementor communities. To move forward someone should pick up all the discussions in this PR and agree on solutions. There might be TODOs left in it as well. Unfortunately, I don't have the time to work on it anymore.

gszadovszky avatar Feb 15 '23 13:02 gszadovszky

Wild idea: instead of defining core features, how about rephrasing this in terms of presets?

We could have a growing number of calendar-versioned presets, example:

  • Preset 2023.06 : v2 data pages + delta encodings + ZSTD + Snappy + ZLib (+ logical types etc.)
  • Preset 2024.11 : the former + byte stream split encoding + LZ4_RAW
  • ...

I'm also skeptical that this needs to be advertised in the Thrift metadata. Presets would mostly serve as a guideline for implementations and as an API simplification for users.

(edit: could name this "profiles" instead of "presets" if that's more familiar to users)

pitrou avatar May 31 '23 14:05 pitrou

We could have a growing number of calendar-versioned presets, example:

@pitrou I'm not really familiar with the term presets or profiles. What are the implications across parquet-mr and parquet-cpp for them? How would they be used?

emkornfield avatar Jun 03 '23 18:06 emkornfield

A preset would be a set of features.

On the read side, each implementation would document the presets it's compatible with (meaning they are able to read all the features in the preset).

On the write side, the user would be able to ask for a given preset when writing a Parquet file (potentially using any feature in the preset, but not features outside of it).

pitrou avatar Jun 03 '23 20:06 pitrou

For an analogy, you can think of RISC-V profiles or MPEG profiles, but I would suggest Parquet profiles (or presets) to be monotonically increasing for ease of use (i.e. each profile is a superset of any earlier profile).

pitrou avatar Jun 03 '23 20:06 pitrou

I've been doing more digging into parquet-format and versioning of the spec has definitely been one of the more confusing pieces. I'm glad that there is an effort to define "core" features, or feature presets.

I made a simple tool that just reads Parquet files and produces a boolean checklist of features that are being used in each file: https://github.com/Eventual-Inc/parquet-benchmarking. Running a tool like this on "data in the wild" has been our approach so far for understanding what features are actively being used by the main Parquet producers (Arrow, Spark, Trino, Impala etc).

It could be useful for developers of Parquet-producing frameworks to start producing Parquet read/write feature compatibility documentation. As a community we can then start to understand which features have been actively adopted and which ones should be considered more experimental or specific to a given framework!


A cool idea might be to have a parquet-features repository where I can contribute Daft (our dataframe framework) code which:

  1. Reads canonical Parquet files that uses certain feature presets
  2. Writes Parquet files to use certain feature presets

The repository can then have automated CI to publish reports on each implementation's supported features.

jaychia avatar Jul 11 '23 20:07 jaychia