iceberg-rust Implement nan_value_counts && distinct

For parquet writer, we still miss following field in DataFile.

nan_value_count
distinct_counts

Jun 23 '24 14:06 ZENOTME

I can take this up @liurenjie1024

Jul 09 '24 04:07 vaibhawvipul

I can take this up @liurenjie1024

Thanks!

Jul 09 '24 04:07 liurenjie1024

I can take this up @liurenjie1024

Welcome!

Jul 09 '24 04:07 Xuanwo

Just checking in @vaibhawvipul if you're still interested in adding this :)

Nov 27 '24 14:11 Fokko

Hey @Fokko ! 👋🏻

As the original author has not replied, I am interested in taking it up :)

Few points regardless of who this gets assigned to:

I couldn't see distinct_counts in java or python documentation, am I reading them wrong somewhere, if they are present can someone point me to them please? Also, from what I understand, distinct counts are present on ColumnChunk level, but they would not be possible to aggregate at DataFile level because fields can be same between two different ColumnChunks. Am I understanding this correctly?
For NaN value counts, as the javadoc mentions:

Parquet/ORC keeps track of most metrics in file statistics, and only NaN counter is actually tracked by writers. This wrapper ensures that metrics not being updated by those writers will not be incorrectly used, by throwing exceptions when they are accessed.

We will have to keep track of it on our own, so I think we would go through each Field in each Column of RecordBatch supplied here, find float values and then count NaNs in it. Is this understanding correct?

Dec 01 '24 11:12 feniljain

Hi, @feniljain I also didn't find how distinct counts are implemented in java, but according to the spec it's supposed to be an estimated value using sketch. I think we could start with nan values and ignore distinct counts first.

Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts

Dec 11 '24 05:12 liurenjie1024

but according to the spec it's supposed to be an estimated value using sketch

That sounds interesting, thanks for the link up to spec!

I think we could start with nan values and ignore distinct counts first.

Yup, let me work out a PR for nan_values first, also just confirming is the method mentioned by me up above correct for nan_values?

Dec 11 '24 06:12 feniljain

Yup, let me work out a PR for nan_values first, also just confirming is the method mentioned by me up above correct for nan_values?

Yes, exactly.

Dec 16 '24 02:12 liurenjie1024

Implement nan_value_counts && distinct_counts metrics in parquet writer