parquet-java FilteredRecordReader skips rows it shouldn't for schema with optional columns

When using UnboundRecordFilter with nested AND/OR filters over OPTIONAL columns, there seems to be a case with a mismatch between the current record's column value and the value read during filtering.

The structure of my filter predicate that results in incorrect filtering is: (x && (y || z))

When I step through it with a debugger I can see that the value being read from the ColumnReader inside my Predicate is different than the value for that row.

Looking deeper there seems to be a buffer with dictionary keys in RunLenghBitPackingHybridDecoder (I am using RLE). There are only two different keys in this array, [0,1], whereas my optional column has three different values, [null,0,1]. If I had a column with values 5,10,10,null,10, and keys 0 -> 5 and 1 -> 10, the buffer would hold 0,1,1,1,0, and in the case that it reads the last row, would return 0 -> 5.

So it seems that nothing is keeping track of where nulls appear.

Hope someone can take a look, as it is a blocker for my project.

Environment: Linux, Java7/Java8 Reporter: Steven Mellinger

Related issues:

filter2 API performance regression (relates to)

_{Note: This issue was originally created as PARQUET-182. Please see the migration documentation for further details.}

Feb 06 '15 03:02 asfimport

Steven Mellinger: Adding the csv data I used to create the Parquet file my tests run against (missing data is treated as null): 2014-08-01,2014-08-10,X,Mark,5,111.111,1 2014-08-02,2014-08-10,,Mark,5,222.222,2 2014-08-01,2014-08-10,Y,,5,333.333,3 2014-08-02,2014-08-10,Y,Mark,,444.444,4 2014-08-01,2014-08-20,X,Randy,5,,5 2014-08-02,2014-08-20,X,Randy,5,666.666,6 2014-08-01,2014-08-20,X,Randy,10,777.777, 2014-08-02,2014-08-20,X,Randy,10,888.888,8

Feb 11 '15 20:02 asfimport

Steven Mellinger: Any update on this issue? picking up the latest 1.7.0 jar didn't have any impact.

May 21 '15 23:05 asfimport

Alex Levenson / @isnotinvain: Yikes – I was sort of worried that this was the case when I was building the filter2 API (I wasn't sure how the unbound record filter was managing to keep track of nulls, or repeated values for that matter) – i wonder if this is also part of the cause for PARQUET-98 (filter2 API is slower than unbound record filter).

Jul 22 '15 02:07 asfimport

Steven Mellinger: I'm surprised no one else has encountered this; it seems like a major blocker. I would love a resolution; my current project has multiple implementations of readers depending on wether or not the schena contains nullables.

-Steve Mellinger

Jul 22 '15 05:07 asfimport

Alex Levenson / @isnotinvain: Hi [~stevemel]

Yes, I'm surprised nobody else has run into this, I don't know how many people are using the unbound record filter, or if they are using it on optional columns / nested schemas.

As a quick work-around, you could give the filter2 api a try. Have you seen this API? It accomplishes the same goal, and also potentially applies filters to metadata about chunks of data which can be a huge win in some cases.

Jul 22 '15 19:07 asfimport

Steven Mellinger: I'm using both filter APIs in my project. Filter2 was much slower, so I wanted to use Filter1 API in all cases. I currently use Filter2 when the schema has nullables and Filter1 API for schema that do not.

On Wed, Jul 22, 2015 at 12:37 PM, Alex Levenson (JIRA) [email protected]

Jul 23 '15 06:07 asfimport