parquet-format icon indicating copy to clipboard operation
parquet-format copied to clipboard

Add SortingColumn on a file level

Open coastalwhite opened this issue 5 months ago • 3 comments

Describe the enhancement requested

Currently, the sorting_columns is only defined on the RowGroupMetadata. This makes it hard to get the sort status of the entire file when it has more than one row group.

This makes it generally very hard to utilize during query optimization.

Is there any interest in adding sorting_columns to the FileMetadata optionally? I feel like this is more generally useful than having it on the row group metadata. I would be happy to draft up a proposal.

coastalwhite avatar Jun 17 '25 09:06 coastalwhite

FWIW this was briefly discussed (https://github.com/apache/parquet-format/pull/242#discussion_r1603871178) during early discussions of revamping the metadata.

etseidl avatar Jun 17 '25 23:06 etseidl

I was honestly quite surprised to learn that this is how it worked. In the current state, you would basically have to read records to confirm that sorting holds between row groups, which would need to happen during planning.

As far as I understand, this is not a backwards incompatible change. Should I file an RFC or a draft PR?

coastalwhite avatar Jun 18 '25 07:06 coastalwhite

You'll get more visibility on the dev mailing list. I'd suggest raising the issue there first.

etseidl avatar Jun 18 '25 13:06 etseidl