parquet-java
parquet-java copied to clipboard
[PARQUET-36] Add support for dictionaries to FilteringPrimitiveConverter
https://issues.apache.org/jira/browse/PARQUET-36
Please take a look. I have made some changes to Dictionary which need careful consideration and perhaps an alternative approach is better.
Also I made some changes to example implementation.
I am going to think about adding a few more tests.
Unfortunately I am having real problems building all project as I can get thrift 0.7 for my Mac. Does anyone know how to do this?
https://github.com/Parquet/parquet-mr/wiki/Developer-documentation#version-of-the-thrift-compiler-to-install
Thanks - I think I tried these instructions a few weeks ago and I couldn’t compile thrift. I can’t remember why not, I think it may have been to do with yosemite.
I’ll give it another go.
Mick
On 5 Feb 2015, at 18:36, Julien Le Dem [email protected] wrote:
https://github.com/Parquet/parquet-mr/wiki/Developer-documentation#version-of-the-thrift-compiler-to-install https://github.com/Parquet/parquet-mr/wiki/Developer-documentation#version-of-the-thrift-compiler-to-install — Reply to this email directly or view it on GitHub https://github.com/apache/incubator-parquet-mr/pull/117#issuecomment-73100492.
oops test is wrong - fixing now
I have fixed the test I wrote which exercises the dictionary filtering for eq and neq cases. The generator produced some code for udfs. I'll look for some tests of this and add to those also.
I am running into the issue that is detailed here - https://issues.apache.org/jira/browse/THRIFT-2229 https://issues.apache.org/jira/browse/THRIFT-2229. This prevents me building thrift 0.7, I’m not sure if there is a workaround.
Have other people got thrift 0.7 running on Mac OS 10.9 or above?
Mick
On 5 Feb 2015, at 18:44, Michael Davies [email protected] wrote:
Thanks - I think I tried these instructions a few weeks ago and I couldn’t compile thrift. I can’t remember why not, I think it may have been to do with yosemite.
I’ll give it another go.
Mick
On 5 Feb 2015, at 18:36, Julien Le Dem <[email protected] mailto:[email protected]> wrote:
https://github.com/Parquet/parquet-mr/wiki/Developer-documentation#version-of-the-thrift-compiler-to-install https://github.com/Parquet/parquet-mr/wiki/Developer-documentation#version-of-the-thrift-compiler-to-install — Reply to this email directly or view it on GitHub https://github.com/apache/incubator-parquet-mr/pull/117#issuecomment-73100492.
Is there a reason why parquet continues to depend on this old version of thrift? I think I read somewhere it was used just for testing.
Would it be easy to upgrade?
Thanks
Mick
On 7 Feb 2015, at 17:48, Michael Davies [email protected] wrote:
I am running into the issue that is detailed here - https://issues.apache.org/jira/browse/THRIFT-2229 https://issues.apache.org/jira/browse/THRIFT-2229. This prevents me building thrift 0.7, I’m not sure if there is a workaround.
Have other people got thrift 0.7 running on Mac OS 10.9 or above?
Mick
On 5 Feb 2015, at 18:44, Michael Davies <[email protected] mailto:[email protected]> wrote:
Thanks - I think I tried these instructions a few weeks ago and I couldn’t compile thrift. I can’t remember why not, I think it may have been to do with yosemite.
I’ll give it another go.
Mick
On 5 Feb 2015, at 18:36, Julien Le Dem <[email protected] mailto:[email protected]> wrote:
https://github.com/Parquet/parquet-mr/wiki/Developer-documentation#version-of-the-thrift-compiler-to-install https://github.com/Parquet/parquet-mr/wiki/Developer-documentation#version-of-the-thrift-compiler-to-install — Reply to this email directly or view it on GitHub https://github.com/apache/incubator-parquet-mr/pull/117#issuecomment-73100492.
@MickDavies: I don't think you need to use thrift 0.7.0. I've built with 0.9.0 without problems, and newer versions (before 0.9.2 at least) should work.
Hi @MickDavies Thanks for looking into this. I made some comments above. There are two different features in there:
- allow the dictionary ids to go through when the underlying converter supports them
- make the filtering more efficient when there's a dictionary by knowing all the values before hand. I'd like someone to add in the description of PARQUET-36 a short description of how we achieve these. Does it sound good? Thanks again! Julien
Hi @julienledem
I guess you could separate these two features, I hadn't done so in my mind. The former is important from measurements I have done for some datasets using Spark. For data I am using a query over a table with around 100M rows can run 25% faster.
I guess the second one just seemed like an obvious optimization
both features are useful and they will both bring gains for different reasons. We don't need to split the JIRA in two but we should keep in mind those 2 distinct goals when reviewing. I think there are several ways the second optimization could be done and that's why I think a short spec in the JIRA would be useful.
Thanks for all the feedback, they comments were very useful.
I have responded to most of the comments in the last check-in, with:
- Change method name update to evaluateFilterForDictionaryElement to clarify intent
- Change Dictionary classes so that type information is abstract method associated with subclass, not carried as attribute
- Change FilteringPrimitiveConverter so that it supports dictionaries even if delegates do not
- Simplify implementation of SimplePrimitiveConverter by using type converter
I still have to do the following:
- Update JIRA as requested by @julienledem
- Add tests for dictionary filtering where delegate does not support dictionaries
- Ensure build works completely - try to get build going with thrift 0.9.1 working - looks positive
- Look at version conformance issues.
I've added a unit test for FilteringPrimitiveConverter. I wanted to use a mocking library and saw that mockito and eaymock referenced in poms though only mockito seems to be being used, so I used that.
Also I upgraded to junit 4.12 from 4.10, so that I could set the name in Parameterized junit annotation, which makes the test output look a bit nicer, but is not essential.
I'm going to simplify this work removing changes to example simple implementation
Is anyone free to review, I would like to get this change in or rejected. Thanks
Can this MR be reviewed please? This is a big performance blocker when reading a large parquet file with many String columns using a Filter as new String instances are created for each record read even though cardinality is low.