parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

GH-3242: Emit and Read min/max statistics for int96 timestamp columns

Open rahulketch opened this issue 6 months ago • 8 comments

Rationale for this change

Parquet-java does not emit or read stats for int96 timestamp columns. Since int96 is used as the default timestamp in spark, this limits a lot of optimization opportunities. Engines like Photon populate the statistics for the int96 timestamps correctly. So parquet-java can also emit the statistics, and also allow reading these statistics from known good writers.

What changes are included in this PR?

  1. Defining column ordering for int96
  2. Implementation of comparator for int96
  3. A flag parquet.read.int96stats.enabled to control if the stats are read. It is defaulted to true
  4. ValidInt96Stats: Reads stats from known good writers. Currently including: a. parquet-mr 1.15.0+ b. photon
  5. Fixing tests which fail due to this change

Are these changes tested?

  1. Added more unit tests for the new behaviour

Are there any user-facing changes?

  1. toParquetStatistics and fromParquetStatistics are no longer static functions. It doesn't look like there is a good reason for these functions to be static.
  2. Signatures of some of the constructors of ParquetMetadataConverter is now different due to an added parameter.

Closes #3242

rahulketch avatar Jun 12 '25 10:06 rahulketch

It looks like Alkis started a discussion on the ML, but we probably want to come to a consensus and update https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1079 first before merging this.

emkornfield avatar Jun 13 '25 16:06 emkornfield

It looks like Alkis started a discussion on the ML, but we probably want to come to a consensus and update https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1079 first before merging this.

@emkornfield : I created the PR: https://github.com/apache/parquet-format/pull/503

rahulketch avatar Jun 25 '25 10:06 rahulketch

After this change, toParquetStatistics and fromParquetStatistics in ParquetMetadataConverter are no longer static functions. This breaks a test on compatibility (see here). @emkornfield @rdblue: Do you have advice on how to move forward with those changes?

rahulketch avatar Jul 01 '25 11:07 rahulketch

@emkornfield @rdblue Given the latest message in the mailing thread, can we merge this? If so, could you help me with addressing this issue

rahulketch avatar Jul 23 '25 20:07 rahulketch

@emkornfield @rdblue Given the latest message in the mailing thread, can we merge this? If so, could you help me with addressing this issue

Sorry I'm not sure on the API change. Lets get (lazy) consensus on the message for this direction. I think @rdblue had the strongest opinion on SortOrder

emkornfield avatar Jul 23 '25 22:07 emkornfield

After this change, toParquetStatistics and fromParquetStatistics in ParquetMetadataConverter are no longer static functions. This breaks a test on compatibility (see here). @emkornfield @rdblue: Do you have advice on how to move forward with those changes?

@wgtmac : could you help with this?

rahulketch avatar Jul 24 '25 12:07 rahulketch

After this change, toParquetStatistics and fromParquetStatistics in ParquetMetadataConverter are no longer static functions. This breaks a test on compatibility (see here). @emkornfield @rdblue: Do you have advice on how to move forward with those changes?

@wgtmac : could you help with this?

@wgtmac : Could you help with this, or suggest who could help out? Thanks!

rahulketch avatar Sep 03 '25 21:09 rahulketch

@rahulketch Sorry I'm too busy these days. Perhaps I can find some time next week.

wgtmac avatar Sep 05 '25 09:09 wgtmac