parquet-java GH-1452: implement Size() filter for repeated columns

Rationale for this change

this PR continues the work outlined in #1452. It implements a size() predicate for filtering on # of elements in repeated fields:

FilterPredicate hasThreeElements = size(intColumn("my_list_field"), Operators.Size.Operator.EQ, 3)

What changes are included in this PR?

Size() and not(size()) implemented for all list fields with required element type. Attempting to filter on a list of optional elements will throw an exception in the schema validator. This is because the existing record-level filtering setup (IncrementallyUpdatedFilterPredicateEvaluator) only feeds in non-null values to the ValueInspectors. thus if you had an array [1,2, null, 4] it would only count 3 elements. I can file a ticket to support this eventually but I think we'd have to rework the FilteringRecordMaterializer to be aware of repetition/definition levels.

The list group itself can be optional or required. Null lists are treated as having size 0. Again, this is due to difficulty disambiguating them at the record-level filtering step. (Would love feedback on both these design decisions!!)

Are these changes tested?

Unit tests + tested a snapshot build locally with real datasets

Are there any user-facing changes?

New Operators API

Part of #1452

Dec 05 '24 20:12 clairemcginty

Thanks for adding this! This is a large PR that I need to take some time to review.

It would be good if @emkornfield @gszadovszky could take a look to see if this is a good use case for SizeStatistics.

Dec 13 '24 15:12 wgtmac

Thanks for adding this! This is a large PR that I need to take some time to review.

thanks, no rush on reviewing it! 👍

Dec 16 '24 18:12 clairemcginty

I can try to look in more detail but stats can certainly be used here, I imagine they are most useful for repeated fieds when trying to discriminate between repeated fields that mostly have 0 or 1 element, and trying to filter out cases with > 0 or 1 elements. e.g. if all fields have 0 observed rep_levels of 1, then one knows for sure all lists are of length 0 or 1 (whether there are any lists of length 0 or one can be deteremined by inspecting the def level histogram). For larger cardinality lists the filtering power diminishes significanly (its hard to distinguish based on histograms the difference between many very small lists vs one very large one).

Dec 20 '24 20:12 emkornfield

Thanks for the effort! I just took an initial pass on it and left a couple of questions.

Thanks for the review!! I should have time to address everything early next week at the latest 👍

Jan 16 '25 20:01 clairemcginty

BTW, the level histogram might not be available when max_level is 0 because there is only single level (i.e. 0) and its count can be deduced from num_values of the column chunk or page. It will complicate the size filter here.

Jan 17 '25 01:01 wgtmac