Implement nan_value_counts && distinct_counts metrics in parquet writer
For parquet writer, we still miss following field in DataFile.
- nan_value_count
- distinct_counts
I can take this up @liurenjie1024
I can take this up @liurenjie1024
Thanks!
I can take this up @liurenjie1024
Welcome!
Just checking in @vaibhawvipul if you're still interested in adding this :)
Hey @Fokko ! 👋🏻
As the original author has not replied, I am interested in taking it up :)
Few points regardless of who this gets assigned to:
- I couldn't see
distinct_countsin java or python documentation, am I reading them wrong somewhere, if they are present can someone point me to them please? Also, from what I understand, distinct counts are present onColumnChunklevel, but they would not be possible to aggregate atDataFilelevel because fields can be same between two differentColumnChunks. Am I understanding this correctly? - For
NaNvalue counts, as the javadoc mentions:
Parquet/ORC keeps track of most metrics in file statistics, and only NaN counter is actually tracked by writers. This wrapper ensures that metrics not being updated by those writers will not be incorrectly used, by throwing exceptions when they are accessed.
We will have to keep track of it on our own, so I think we would go through each Field in each Column of RecordBatch supplied here, find float values and then count NaNs in it. Is this understanding correct?
Hi, @feniljain I also didn't find how distinct counts are implemented in java, but according to the spec it's supposed to be an estimated value using sketch. I think we could start with nan values and ignore distinct counts first.
Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts
but according to the spec it's supposed to be an estimated value using sketch
That sounds interesting, thanks for the link up to spec!
I think we could start with nan values and ignore distinct counts first.
Yup, let me work out a PR for nan_values first, also just confirming is the method mentioned by me up above correct for nan_values?
Yup, let me work out a PR for nan_values first, also just confirming is the method mentioned by me up above correct for nan_values?
Yes, exactly.