Add support for repeated columns in the filter2 API
They currently are not supported. They would need their own set of operators, like contains() and size() etc.
Reporter: Alex Levenson / @isnotinvain
PRs and other links:
Note: This issue was originally created as PARQUET-34. Please see the migration documentation for further details.
Viktor Szathmáry / @phraktle: Any plans on supporting this?
Alex Levenson / @isnotinvain: We'd like to, contains() at least shouldn't be too hard, it may not even require any changes except removing the check that prevents it.
That said we don't have any concrete plans to do it right away – pull requests always welcome and we're happy to help!
Flavio Pompermaier: Repeated columns (protobuf) and lists (thrift) support would be VERY useful..
Stefano Bortoli: I agree with Flavio. +1 for supporting multi-valued columns.
Nicola: +1
Julien Le Dem / @julienledem: @alexlevenson did you have any plan for this?
Flavio Pompermaier: If someone could give me some advice I could try to implement it by myself (I never looked into Parquet code...)
Julien Le Dem / @julienledem: You can start by looking at the filter2 api contribution: commit: https://github.com/apache/parquet-mr/commit/ad32bf0fd111ab473ad1080cde11de39e3c5a67f PR: https://github.com/apache/parquet-mr/pull/4 older PR with feedback: https://github.com/Parquet/parquet-mr/pull/412
Flavio Pompermaier: I think that introducing contains() is definitely not as simple as mentioned at the beginning of this post. It would require a huge rework of parquet-column in order to handle Collections (PARQUET-35) and filtering on row groups (see PARQUET-43).
I'll be glad to contribute but I can't do it alone without a direct support from your side (at least to agree on the changes to apport at the core code). Do you think you can spend some time to actively focus with me on this task in the next days...?
Flavio Pompermaier: Anyone willing to push this feature..? :(
Ryan Blue / @rdblue: [~f.pompermaier], I don't think anyone has extra cycles to spend implementing this right now, but if you are interested in building it, we'll work with you to get it reviewed and included.
I think the next step is to write up what you think needs to be done so we can look at it and help you in the right direction. It may be that the disconnect between Alex's comment about this being easy and your assessment is a different level of support.
Flavio Pompermaier: Am I still the only one asking for this feature..?
Ryan Blue / @rdblue: I think it could be useful, but it doesn't seem high priority for anyone. We're happy to review it if you put something together.
Sean Scott: I was surprised when my search failed when attempting to filter on a repeated column.
I know its been a few years since there has been activity here. What is everyone doing for pushdown predicates to avoid huge data transfers/ row scans?
Claire McGinty / @clairemcginty: Current status of this ticket:
-
https://github.com/apache/parquet-java/pull/1328 (support for Contains(eq)) has been merged if anyone would like to test it
-
https://github.com/apache/parquet-java/pull/1370 (support for Contains(lt), Contains(gt), Contains(in), etc) is open
-
Todo: Support not(contains(...))
-
Todo: Support size()
I can finish up PR #1370 but probably won't have bandwidth to work on the remaining Todos until July.
contains() and not(contains()) have been added, the only operation missing now is size().
Probably we want to be able to express eq/lt/gt operations here, i.e. size(arrayColumn("foo"), lt(5)).
I think the bulk of the implementation logic will be at the record-level (IncrementallyUpdatedFilterPredicate). DictionaryFilter/StatisticsFilter/RowGroupFilter can probably do some very basic filtering (i.e. if user wants to filter to size(col, eq(0)), we can pass/fail the row groups if there are any values present), but not much more.
@clairemcginty I'm not sure whether the size filter could leverage SizeStatistics which provides the histograms of def & rep levels. (cc @emkornfield)
(i.e. if user wants to filter to size(col, eq(0)), we can pass/fail the row groups if there are any values present),
I don't think this is accurate. Because you can have lists that contain only null elements (e.g. [null, null]) so we couldn't say for sure no values present means that lists must be of size zero, and if there is a value present we can't say for sure there are lists with no values. As Gang noted you can make these determinations with the repetition/definition levels.
I think we make the following conclusions making the repetition and definitional levels (assuming 3 level list encoding).
- If all repetition levels are zero and all definitions levels are 0, then all lists are null (note: parquet i believe tends to conflate empty and null lists, so it would be good to clarify intended semantics)
- If all repetition levels are zero and all definitions levels are 1, then all lists are of size zero.
- If all repetition levels are zero and all definitions level are > MAX_DEFINITION_LEVEL - 1 , all lists are of size 1.
- If count of MAX_REPETITION_LEVEL < X then all all lists are have size <= X - 1 (in practice this might not really filter out too much).
This logic could be extended for intermediate lists.
@emkornfield, those rules makes sense. I guess that logic could be checked in ColumnIndexFilter where we have access to repetition/definition-level histograms?
It's an interesting point about null values in lists vs empty lists -- in my implementation of the Contains predicate I disallowed contains(null) due to difficulty in disambiguating the two. My feeling is that [null, null] is a list of size 2, since null is a valid value for an optional type--however, this would be difficult to implement properly, since I think that the record-level filter (IncrementallyUpdatedFilterPredicate) only invokes ValueInspector#update on non-null items.
Nice work on this @clairemcginty, looking forward to using it to filter geoparquet geoarrow geometries by bounding box.