datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

DataFusion ignores "column order" parquet statistics specification

Open alamb opened this issue 1 year ago • 0 comments

Describe the bug

As @tustvold points out, there is a column_order API defined in parquet that is currently entirely ignored by DataFusion

It is not entirely clear to me what the implications of ignoring this field are or what other parquet writers populate it with, but we should probably not ignore it

To Reproduce

No response

Expected behavior

No response

Additional context

To emphasise the point I made when this API was originally proposed, you need more than just the ParquetStatistics in order to correctly interpret the data. You need at least the FileMetadata to get the https://docs.rs/parquet/latest/parquet/file/metadata/struct.FileMetaData.html#method.column_order in order to be able to even interpret what the statistics mean for a given column.

alamb avatar May 20 '24 18:05 alamb