GH-3242: Emit and Read min/max statistics for int96 timestamp columns
Rationale for this change
Parquet-java does not emit or read stats for int96 timestamp columns. Since int96 is used as the default timestamp in spark, this limits a lot of optimization opportunities. Engines like Photon populate the statistics for the int96 timestamps correctly. So parquet-java can also emit the statistics, and also allow reading these statistics from known good writers.
What changes are included in this PR?
- Defining column ordering for int96
- Implementation of comparator for int96
- A flag
parquet.read.int96stats.enabledto control if the stats are read. It is defaulted to true ValidInt96Stats: Reads stats from known good writers. Currently including: a.parquet-mr 1.15.0+b.photon- Fixing tests which fail due to this change
Are these changes tested?
- Added more unit tests for the new behaviour
Are there any user-facing changes?
toParquetStatisticsandfromParquetStatisticsare no longer static functions. It doesn't look like there is a good reason for these functions to be static.- Signatures of some of the constructors of
ParquetMetadataConverteris now different due to an added parameter.
Closes #3242
It looks like Alkis started a discussion on the ML, but we probably want to come to a consensus and update https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1079 first before merging this.
It looks like Alkis started a discussion on the ML, but we probably want to come to a consensus and update https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1079 first before merging this.
@emkornfield : I created the PR: https://github.com/apache/parquet-format/pull/503
After this change, toParquetStatistics and fromParquetStatistics in ParquetMetadataConverter are no longer static functions. This breaks a test on compatibility (see here). @emkornfield @rdblue: Do you have advice on how to move forward with those changes?
@emkornfield @rdblue Given the latest message in the mailing thread, can we merge this? If so, could you help me with addressing this issue
@emkornfield @rdblue Given the latest message in the mailing thread, can we merge this? If so, could you help me with addressing this issue
Sorry I'm not sure on the API change. Lets get (lazy) consensus on the message for this direction. I think @rdblue had the strongest opinion on SortOrder
After this change,
toParquetStatisticsandfromParquetStatisticsinParquetMetadataConverterare no longer static functions. This breaks a test on compatibility (see here). @emkornfield @rdblue: Do you have advice on how to move forward with those changes?
@wgtmac : could you help with this?
After this change,
toParquetStatisticsandfromParquetStatisticsinParquetMetadataConverterare no longer static functions. This breaks a test on compatibility (see here). @emkornfield @rdblue: Do you have advice on how to move forward with those changes?@wgtmac : could you help with this?
@wgtmac : Could you help with this, or suggest who could help out? Thanks!
@rahulketch Sorry I'm too busy these days. Perhaps I can find some time next week.