parquet-format icon indicating copy to clipboard operation
parquet-format copied to clipboard

DRAFT: Alternative Strawman proposal for a new V3 footer format in Parquet

Open emkornfield opened this issue 8 months ago • 10 comments

As a point of discussion a slightly different version of showing how column metadata could be decoupled from FileMetadata then https://github.com/apache/parquet-format/pull/242

In particular this takes a slightly different approach:

  1. It introduces a new random access encoding for Parquet to store serialized data instead of relying a one-off index scheme based in the thrift structure. By taking this approach is allows flexibility for implemenations to further balance size vs compute trade-offs and can potentially make use of any further encoding improvements in the future. Two downside of this approach is it requires a little bit more work up-front and has slightly more overhead then directly doing this as thrift structures.
  2. It places the serialized data page completely outside of thrift metadata and instead provides an offset within the footer. This is mostly a micro-optimization (likely not critical) to allow parquet implementors to avoid unnecessary copies of string data if the Thrift library supporting it does not allow it. There is no reason that the pages could not be inlined as a "binary" field in FileMetadata as is done in https://github.com/apache/parquet-format/pull/242
  3. Moves a few other fields out of FileMetadata into a metadata page and raises discussion points on others.
  4. Re-uses existing Thrift objects in attempt to make the transition easier for implementors.

Things it does not do:

  1. Enumerate all fields that should be deprecated https://github.com/apache/parquet-format/pull/242 is a good start and can consolidated on a final list once a general approach is taken.
  2. Incorporate changes in https://github.com/apache/parquet-format/pull/248 these also likely make sense but can be incorporated into any final proposal.
  3. Micro-optimizations to separate scan use cases from filter evaluation use-cases (ColumnChunk structure could potentially be further split apart to give finer grained access to elements that are only needed in once case or another).

emkornfield avatar May 27 '24 20:05 emkornfield