parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

[PARQUET-36] Add support for dictionaries to FilteringPrimitiveConverter

Open MickDavies opened this issue 10 years ago • 15 comments

https://issues.apache.org/jira/browse/PARQUET-36

Please take a look. I have made some changes to Dictionary which need careful consideration and perhaps an alternative approach is better.

Also I made some changes to example implementation.

I am going to think about adding a few more tests.

Unfortunately I am having real problems building all project as I can get thrift 0.7 for my Mac. Does anyone know how to do this?

MickDavies avatar Feb 05 '15 16:02 MickDavies

https://github.com/Parquet/parquet-mr/wiki/Developer-documentation#version-of-the-thrift-compiler-to-install

julienledem avatar Feb 05 '15 18:02 julienledem

Thanks - I think I tried these instructions a few weeks ago and I couldn’t compile thrift. I can’t remember why not, I think it may have been to do with yosemite.

I’ll give it another go.

Mick

On 5 Feb 2015, at 18:36, Julien Le Dem [email protected] wrote:

https://github.com/Parquet/parquet-mr/wiki/Developer-documentation#version-of-the-thrift-compiler-to-install https://github.com/Parquet/parquet-mr/wiki/Developer-documentation#version-of-the-thrift-compiler-to-install — Reply to this email directly or view it on GitHub https://github.com/apache/incubator-parquet-mr/pull/117#issuecomment-73100492.

MickDavies avatar Feb 05 '15 18:02 MickDavies

oops test is wrong - fixing now

MickDavies avatar Feb 05 '15 21:02 MickDavies

I have fixed the test I wrote which exercises the dictionary filtering for eq and neq cases. The generator produced some code for udfs. I'll look for some tests of this and add to those also.

MickDavies avatar Feb 06 '15 08:02 MickDavies

I am running into the issue that is detailed here - https://issues.apache.org/jira/browse/THRIFT-2229 https://issues.apache.org/jira/browse/THRIFT-2229. This prevents me building thrift 0.7, I’m not sure if there is a workaround.

Have other people got thrift 0.7 running on Mac OS 10.9 or above?

Mick

On 5 Feb 2015, at 18:44, Michael Davies [email protected] wrote:

Thanks - I think I tried these instructions a few weeks ago and I couldn’t compile thrift. I can’t remember why not, I think it may have been to do with yosemite.

I’ll give it another go.

Mick

On 5 Feb 2015, at 18:36, Julien Le Dem <[email protected] mailto:[email protected]> wrote:

https://github.com/Parquet/parquet-mr/wiki/Developer-documentation#version-of-the-thrift-compiler-to-install https://github.com/Parquet/parquet-mr/wiki/Developer-documentation#version-of-the-thrift-compiler-to-install — Reply to this email directly or view it on GitHub https://github.com/apache/incubator-parquet-mr/pull/117#issuecomment-73100492.

MickDavies avatar Feb 07 '15 17:02 MickDavies

Is there a reason why parquet continues to depend on this old version of thrift? I think I read somewhere it was used just for testing.

Would it be easy to upgrade?

Thanks

Mick

On 7 Feb 2015, at 17:48, Michael Davies [email protected] wrote:

I am running into the issue that is detailed here - https://issues.apache.org/jira/browse/THRIFT-2229 https://issues.apache.org/jira/browse/THRIFT-2229. This prevents me building thrift 0.7, I’m not sure if there is a workaround.

Have other people got thrift 0.7 running on Mac OS 10.9 or above?

Mick

On 5 Feb 2015, at 18:44, Michael Davies <[email protected] mailto:[email protected]> wrote:

Thanks - I think I tried these instructions a few weeks ago and I couldn’t compile thrift. I can’t remember why not, I think it may have been to do with yosemite.

I’ll give it another go.

Mick

On 5 Feb 2015, at 18:36, Julien Le Dem <[email protected] mailto:[email protected]> wrote:

https://github.com/Parquet/parquet-mr/wiki/Developer-documentation#version-of-the-thrift-compiler-to-install https://github.com/Parquet/parquet-mr/wiki/Developer-documentation#version-of-the-thrift-compiler-to-install — Reply to this email directly or view it on GitHub https://github.com/apache/incubator-parquet-mr/pull/117#issuecomment-73100492.

MickDavies avatar Feb 09 '15 08:02 MickDavies

@MickDavies: I don't think you need to use thrift 0.7.0. I've built with 0.9.0 without problems, and newer versions (before 0.9.2 at least) should work.

rdblue avatar Mar 04 '15 20:03 rdblue

Hi @MickDavies Thanks for looking into this. I made some comments above. There are two different features in there:

  • allow the dictionary ids to go through when the underlying converter supports them
  • make the filtering more efficient when there's a dictionary by knowing all the values before hand. I'd like someone to add in the description of PARQUET-36 a short description of how we achieve these. Does it sound good? Thanks again! Julien

julienledem avatar Mar 10 '15 17:03 julienledem

Hi @julienledem

I guess you could separate these two features, I hadn't done so in my mind. The former is important from measurements I have done for some datasets using Spark. For data I am using a query over a table with around 100M rows can run 25% faster.

I guess the second one just seemed like an obvious optimization

MickDavies avatar Mar 10 '15 17:03 MickDavies

both features are useful and they will both bring gains for different reasons. We don't need to split the JIRA in two but we should keep in mind those 2 distinct goals when reviewing. I think there are several ways the second optimization could be done and that's why I think a short spec in the JIRA would be useful.

julienledem avatar Mar 10 '15 17:03 julienledem

Thanks for all the feedback, they comments were very useful.

I have responded to most of the comments in the last check-in, with:

  • Change method name update to evaluateFilterForDictionaryElement to clarify intent
  • Change Dictionary classes so that type information is abstract method associated with subclass, not carried as attribute
  • Change FilteringPrimitiveConverter so that it supports dictionaries even if delegates do not
  • Simplify implementation of SimplePrimitiveConverter by using type converter

I still have to do the following:

  • Update JIRA as requested by @julienledem
  • Add tests for dictionary filtering where delegate does not support dictionaries
  • Ensure build works completely - try to get build going with thrift 0.9.1 working - looks positive
  • Look at version conformance issues.

MickDavies avatar Mar 11 '15 20:03 MickDavies

I've added a unit test for FilteringPrimitiveConverter. I wanted to use a mocking library and saw that mockito and eaymock referenced in poms though only mockito seems to be being used, so I used that.

Also I upgraded to junit 4.12 from 4.10, so that I could set the name in Parameterized junit annotation, which makes the test output look a bit nicer, but is not essential.

MickDavies avatar Mar 15 '15 14:03 MickDavies

I'm going to simplify this work removing changes to example simple implementation

MickDavies avatar Apr 02 '15 12:04 MickDavies

Is anyone free to review, I would like to get this change in or rejected. Thanks

MickDavies avatar May 22 '15 10:05 MickDavies

Can this MR be reviewed please? This is a big performance blocker when reading a large parquet file with many String columns using a Filter as new String instances are created for each record read even though cardinality is low.

mpeleshenko avatar Jul 11 '19 13:07 mpeleshenko