datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Add example for building an external secondary index for parquet files

Open alamb opened this issue 1 year ago • 1 comments

Note: While this PR looks very large (728 lines) around half of the content is comments / docstrings

Which issue does this PR close?

Closes https://github.com/apache/datafusion/issues/10546

Rationale for this change

See https://github.com/apache/datafusion/issues/10546

Building and using external indexes in DataFusion is an important feature. Adding an example of how to do so will help drive the design and APIs

What changes are included in this PR?

New Example

Are these changes tested?

CI

Are there any user-facing changes?

No -- just an example

TODOs

  • [x] Propose a nicer API for extracting the statistics
  • [x] Connect pruning predicate into scan to avoid scanning files
  • [ ] File tickets / PRs to make creating ParquetExec easier
  • [ ] Try and make some PRs / documentation upstream in the parquet crate to make it easier to work with parquet statistics

alamb avatar May 16 '24 15:05 alamb

This PR is now ready for review

alamb avatar May 22 '24 20:05 alamb

@crepererum and @NGA-TRAN -- here is a PR ready for your review that shows how to do file level pruning with statistics.

I will make an example of how to do row group level / page level pruning next

alamb avatar May 27 '24 12:05 alamb

I start reviewing this

NGA-TRAN avatar May 28 '24 14:05 NGA-TRAN

Thank you very much for the review @crepererum

alamb avatar May 31 '24 10:05 alamb