parquet-format
parquet-format copied to clipboard
Add SortingColumn on a file level
Describe the enhancement requested
Currently, the sorting_columns is only defined on the RowGroupMetadata. This makes it hard to get the sort status of the entire file when it has more than one row group.
This makes it generally very hard to utilize during query optimization.
Is there any interest in adding sorting_columns to the FileMetadata optionally? I feel like this is more generally useful than having it on the row group metadata. I would be happy to draft up a proposal.
FWIW this was briefly discussed (https://github.com/apache/parquet-format/pull/242#discussion_r1603871178) during early discussions of revamping the metadata.
I was honestly quite surprised to learn that this is how it worked. In the current state, you would basically have to read records to confirm that sorting holds between row groups, which would need to happen during planning.
As far as I understand, this is not a backwards incompatible change. Should I file an RFC or a draft PR?
You'll get more visibility on the dev mailing list. I'd suggest raising the issue there first.