parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

Add support for repeated columns in the filter2 API

Open asfimport opened this issue 11 years ago • 20 comments

They currently are not supported. They would need their own set of operators, like contains() and size() etc.

Reporter: Alex Levenson / @isnotinvain

PRs and other links:

Note: This issue was originally created as PARQUET-34. Please see the migration documentation for further details.

asfimport avatar Jul 24 '14 06:07 asfimport

Viktor Szathmáry / @phraktle: Any plans on supporting this?

asfimport avatar Sep 22 '14 22:09 asfimport

Alex Levenson / @isnotinvain: We'd like to, contains() at least shouldn't be too hard, it may not even require any changes except removing the check that prevents it.

That said we don't have any concrete plans to do it right away – pull requests always welcome and we're happy to help!

asfimport avatar Sep 22 '14 23:09 asfimport

Flavio Pompermaier: Repeated columns (protobuf) and lists (thrift) support would be VERY useful..

asfimport avatar Jul 15 '15 09:07 asfimport

Stefano Bortoli: I agree with Flavio. +1 for supporting multi-valued columns.

asfimport avatar Jul 15 '15 09:07 asfimport

Nicola: +1

asfimport avatar Sep 10 '15 15:09 asfimport

Julien Le Dem / @julienledem: @alexlevenson did you have any plan for this?

asfimport avatar Nov 05 '15 22:11 asfimport

Flavio Pompermaier: If someone could give me some advice I could try to implement it by myself (I never looked into Parquet code...)

asfimport avatar Nov 11 '15 10:11 asfimport

Julien Le Dem / @julienledem: You can start by looking at the filter2 api contribution: commit: https://github.com/apache/parquet-mr/commit/ad32bf0fd111ab473ad1080cde11de39e3c5a67f PR: https://github.com/apache/parquet-mr/pull/4 older PR with feedback: https://github.com/Parquet/parquet-mr/pull/412

asfimport avatar Nov 17 '15 18:11 asfimport

Flavio Pompermaier: I think that introducing contains() is definitely not as simple as mentioned at the beginning of this post. It would require a huge rework of parquet-column in order to handle Collections (PARQUET-35) and filtering on row groups (see PARQUET-43).

I'll be glad to contribute but I can't do it alone without a direct support from your side (at least to agree on the changes to apport at the core code). Do you think you can spend some time to actively focus with me on this task in the next days...?

asfimport avatar Nov 23 '15 13:11 asfimport

Flavio Pompermaier: Anyone willing to push this feature..? :(

asfimport avatar Dec 02 '15 22:12 asfimport

Ryan Blue / @rdblue: [~f.pompermaier], I don't think anyone has extra cycles to spend implementing this right now, but if you are interested in building it, we'll work with you to get it reviewed and included.

I think the next step is to write up what you think needs to be done so we can look at it and help you in the right direction. It may be that the disconnect between Alex's comment about this being easy and your assessment is a different level of support.

asfimport avatar Dec 03 '15 00:12 asfimport

Flavio Pompermaier: Am I still the only one asking for this feature..?

asfimport avatar Nov 17 '16 22:11 asfimport

Ryan Blue / @rdblue: I think it could be useful, but it doesn't seem high priority for anyone. We're happy to review it if you put something together.

asfimport avatar Nov 18 '16 17:11 asfimport

Sean Scott: I was surprised when my search failed when attempting to filter on a repeated column.  

 

I know its been a few years since there has been activity here.  What is everyone doing for pushdown predicates to avoid huge data transfers/ row scans?

asfimport avatar Feb 29 '24 19:02 asfimport

Claire McGinty / @clairemcginty: Current status of this ticket:

 

asfimport avatar Jun 05 '24 12:06 asfimport

contains() and not(contains()) have been added, the only operation missing now is size().

Probably we want to be able to express eq/lt/gt operations here, i.e. size(arrayColumn("foo"), lt(5)).

I think the bulk of the implementation logic will be at the record-level (IncrementallyUpdatedFilterPredicate). DictionaryFilter/StatisticsFilter/RowGroupFilter can probably do some very basic filtering (i.e. if user wants to filter to size(col, eq(0)), we can pass/fail the row groups if there are any values present), but not much more.

clairemcginty avatar Aug 06 '24 14:08 clairemcginty

@clairemcginty I'm not sure whether the size filter could leverage SizeStatistics which provides the histograms of def & rep levels. (cc @emkornfield)

wgtmac avatar Aug 06 '24 15:08 wgtmac

(i.e. if user wants to filter to size(col, eq(0)), we can pass/fail the row groups if there are any values present),

I don't think this is accurate. Because you can have lists that contain only null elements (e.g. [null, null]) so we couldn't say for sure no values present means that lists must be of size zero, and if there is a value present we can't say for sure there are lists with no values. As Gang noted you can make these determinations with the repetition/definition levels.

I think we make the following conclusions making the repetition and definitional levels (assuming 3 level list encoding).

  1. If all repetition levels are zero and all definitions levels are 0, then all lists are null (note: parquet i believe tends to conflate empty and null lists, so it would be good to clarify intended semantics)
  2. If all repetition levels are zero and all definitions levels are 1, then all lists are of size zero.
  3. If all repetition levels are zero and all definitions level are > MAX_DEFINITION_LEVEL - 1 , all lists are of size 1.
  4. If count of MAX_REPETITION_LEVEL < X then all all lists are have size <= X - 1 (in practice this might not really filter out too much).

This logic could be extended for intermediate lists.

emkornfield avatar Aug 06 '24 18:08 emkornfield

@emkornfield, those rules makes sense. I guess that logic could be checked in ColumnIndexFilter where we have access to repetition/definition-level histograms?

It's an interesting point about null values in lists vs empty lists -- in my implementation of the Contains predicate I disallowed contains(null) due to difficulty in disambiguating the two. My feeling is that [null, null] is a list of size 2, since null is a valid value for an optional type--however, this would be difficult to implement properly, since I think that the record-level filter (IncrementallyUpdatedFilterPredicate) only invokes ValueInspector#update on non-null items.

clairemcginty avatar Aug 06 '24 19:08 clairemcginty

Nice work on this @clairemcginty, looking forward to using it to filter geoparquet geoarrow geometries by bounding box.

msbarry avatar Aug 07 '24 00:08 msbarry