parquet-java
parquet-java copied to clipboard
PARQUET-128: Optimize the parquet RecordReader implementation when: A. filterpredicate is pushed down , B. filterpredicate is pushed down on a flat schema
The RecordReader implementation currently will read all the columns before applying the filter predicate and deciding whether to keep the row or discard it. We can have a RecordReader which will only assemble the columns on which filters are applied (which are usually a few), then apply the filter and decide whether to keep the row or not , and then goes on to assemble the remaining columns or skip the remaining columns accordingly. Also for applications like spark sql , the schema usually applied is a flat one with no repeating or nested columns. In such cases, its better to have a light-weight, faster RecordReader. The performance improvement by this change is seen to be significant , and is better in case smaller number of rows are returned by filtering (which is usually the case) and there are many number of columns This PR:
- Added OptFilterRecordReaderImplementation which skips reading columns for the rows rejected by the filter on which filter is never applied.
- Added a light-weight FlatSchemaRecordReaderImplementation with filtering, which only works with a flat schema (non repeated and non nested columns), which uses the above optimization as well. Usually for application like spark sql, the schema is flat and hence this reader results in much faster queries.
Reopened this https://github.com/apache/parquet-mr/pull/82 after rebasing to changed code base
@nongli Does this look related to what you were doing in Spark SQL?
@isnotinvain Since you worked on the initial predicate push down. Did you have an opinion?
Hi,
Sorry to dig up such an old issue, using [email protected] and PageIndex on a file with a schema such as url(string), html(string), ordered by a unique url I'm doing a single row lookup by url.
I was expecting the reader to read only the content's page containing the matching row thanks to the page's offsets.
But apparently it still build the entire row for each tested url, the issue is that I have 1 page of URLs for 3k pages of content thus wasting a huge amount of time transferring and decoding unnecessary data.
Is there a reason why this particular issue was not fixed? It seems that it defeat the whole purpose of PageIndex (in my case at least).
edit:
Tried this PR for my use case and it does not seems to do the job as it calls columnReader.consume() which reads all the pages.