iceberg-rust icon indicating copy to clipboard operation
iceberg-rust copied to clipboard

Implement nan_value_counts && distinct_counts metrics in parquet writer

Open ZENOTME opened this issue 1 year ago • 3 comments

For parquet writer, we still miss following field in DataFile.

  • nan_value_count
  • distinct_counts

ZENOTME avatar Jun 23 '24 14:06 ZENOTME

I can take this up @liurenjie1024

vaibhawvipul avatar Jul 09 '24 04:07 vaibhawvipul

I can take this up @liurenjie1024

Thanks!

liurenjie1024 avatar Jul 09 '24 04:07 liurenjie1024

I can take this up @liurenjie1024

Welcome!

Xuanwo avatar Jul 09 '24 04:07 Xuanwo

Just checking in @vaibhawvipul if you're still interested in adding this :)

Fokko avatar Nov 27 '24 14:11 Fokko

Hey @Fokko ! 👋🏻

As the original author has not replied, I am interested in taking it up :)

Few points regardless of who this gets assigned to:

  • I couldn't see distinct_counts in java or python documentation, am I reading them wrong somewhere, if they are present can someone point me to them please? Also, from what I understand, distinct counts are present on ColumnChunk level, but they would not be possible to aggregate at DataFile level because fields can be same between two different ColumnChunks. Am I understanding this correctly?
  • For NaN value counts, as the javadoc mentions:
Parquet/ORC keeps track of most metrics in file statistics, and only NaN counter is actually tracked by writers. This wrapper ensures that metrics not being updated by those writers will not be incorrectly used, by throwing exceptions when they are accessed.

We will have to keep track of it on our own, so I think we would go through each Field in each Column of RecordBatch supplied here, find float values and then count NaNs in it. Is this understanding correct?

feniljain avatar Dec 01 '24 11:12 feniljain

Hi, @feniljain I also didn't find how distinct counts are implemented in java, but according to the spec it's supposed to be an estimated value using sketch. I think we could start with nan values and ignore distinct counts first.

Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts

liurenjie1024 avatar Dec 11 '24 05:12 liurenjie1024

but according to the spec it's supposed to be an estimated value using sketch

That sounds interesting, thanks for the link up to spec!

I think we could start with nan values and ignore distinct counts first.

Yup, let me work out a PR for nan_values first, also just confirming is the method mentioned by me up above correct for nan_values?

feniljain avatar Dec 11 '24 06:12 feniljain

Yup, let me work out a PR for nan_values first, also just confirming is the method mentioned by me up above correct for nan_values?

Yes, exactly.

liurenjie1024 avatar Dec 16 '24 02:12 liurenjie1024