parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

PARQUET-429: Enables predicates collecting their referred columns

Open lw-lin opened this issue 8 years ago • 3 comments

This PR enables all 3 FilterCompats to collect their referred columns.

Unit tests for the changed codes are also added.

@julienledem @rdblue @isnotinvain @liancheng Could you please take a look at this? Cheers.

lw-lin avatar Jan 14 '16 07:01 lw-lin

Would you mind to provide some sample use cases for this feature?

liancheng avatar Jan 15 '16 17:01 liancheng

Sure.

Currently in the Parquet read path, only the data of columns from requested_schema ∩ file_schema are fetched from the storage, not the data from columns specified only in predicates.

This brings some problems like:

  • PARQUET-389 Filter predicates should work with missing columns
  • PARQUET-295 Filter predicate should be able to filter on columns not specified in projection pushdown
  • PARQUET-425 Fix the bug when predicate contains columns not specified in prejection, to prevent filtering out data improperly
  • SPARK-11103 Parquet filters push-down may cause exception when schema merging is turned on
  • SPARK-11343 flaky test ParquetFilterSuite.SPARK-11103: Filter applied on merged Parquet schema with new column fails

So in order to resolve those issues, we might want to add columns specified only in predicates back into the read path. But as a very first step, we might want to figure out what columns are referred by the predicates -- this is what this PR tries to solve.

The main contribution of this PR is FilterCompatColumnCollector.java, which collects the distinguished columns in a recursive way. Specifically, FilterCompatColumnCollector.java collects columns for: (1) FilterPredicateCompat (predicate api v2); (2) UnboundRecordFilterCompat (predicate api v1); and (3) NoOpFilter.

With this very preliminary feature, we can then add the missing columns specified only in predicates back into the read path in sperate PRs, and then hopefully resolve the aforementioned JIRA tickets.

Would this make sense? Thanks a lot for reviewing @liancheng

lw-lin avatar Jan 16 '16 04:01 lw-lin

@isnotinvain @rdblue opinions?

julienledem avatar Aug 15 '16 22:08 julienledem