dask-examples Add Parquet predicate pushdown filtering example

Here's an example of Parquet predicate pushdown filtering, as suggested by @jrbourbeau here.

I am a Dask newbie and this is my first time working with a Jupyter notebook so be nice! ;)

I am not super-excited about checking in Parquet files to the repo (don't like binaries in GitHub), but it's the best way to keep the example simple. Let me know if you have other suggestions.

Aug 26 '20 23:08 MrPowers

Check out this pull request on

Review Jupyter notebook visual diffs & provide feedback on notebooks.

Powered by ReviewNB

Aug 26 '20 23:08 review-notebook-app[bot]

I am not super-excited about checking in Parquet files to the repo (don't like binaries in GitHub), but it's the best way to keep the example simple. Let me know if you have other suggestions.

I think it might be better to write the parquet files in the example, since (IIRC) they need to be written specifically to support filtering. So perhaps use a built-in dataset like dask.datasets.timeseries() and then show reading it back in with a filter.

Aug 27 '20 15:08 TomAugspurger

@TomAugspurger - thanks for reviewing. I updated the example to build the Parquet lake from CSV files. Checking in CSV files to source control doesn't feel as wrong. The Parquet lake is written to a directory that's already gitignored.

I looked into dask.datasets.timeseries() and it looks like I'd be able to make that work too, but I think it'd make the explanation a bit more complicated. I'm open to that approach if you feel strongly about it.

If this pull request gets merged, I'd like to use the CSV files to provide an expanded discussion on "Select only the columns that you plan to use" in the 01-data-access example. I've done benchmarking in the Spark world and have found that column pruning can provide huge performance boosts. Want to provide a detailed discussion and demonstrate that column pruning isn't possible for CSV lakes, but is possible for Parquet lakes.

Thanks again for reviewing!

Aug 27 '20 18:08 MrPowers

Is it right to say that the parquet files need to be written with specific arguments (like partition_cols) for this performance benefit? If so I'd prefer to write them in the example.

On Thu, Aug 27, 2020 at 1:00 PM Matthew Powers [email protected] wrote:

@TomAugspurger https://github.com/TomAugspurger - thanks for reviewing. I updated the example to build the Parquet lake from CSV files. Checking in CSV files to source control doesn't feel as wrong. The Parquet lake is written to a directory that's already gitignored.

I looked into dask.datasets.timeseries() and it looks like I'd be able to make that work too, but I think it'd make the explanation a bit more complicated. I'm open to that approach if you feel strongly about it.

If this pull request gets merged, I'd like to use the CSV files to provide an expanded discussion on "Select only the columns that you plan to use" in the 01-data-access example. I've done benchmarking in the Spark world and have found that column pruning can provide huge performance boosts. Want to provide a detailed discussion and demonstrate that column pruning isn't possible for CSV lakes, but is possible for Parquet lakes.

Thanks again for reviewing!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-examples/pull/164#issuecomment-682103802, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOISLPFPCAZNWPSAMH7DSC2NNNANCNFSM4QML6H5Q .

Sep 08 '20 13:09 TomAugspurger

@TomAugspurger - thanks for the response. This code reads in CSV files and writes out Parquet files, so the current code should satisfy your wish to write out the Parquet files within the example itself.

Parquet predicate pushdown filtering depends on the parquet metadata statistics (not how the Parquet files are written on disk). Here's how you can access the metadata statistics:

parquet_file = pq.ParquetFile('some_file.parquet')
print(parquet_file.metadata.row_group(0).column(1).statistics)

See here for more info.

The partition_cols filtering is different. That's generally what I call "partition filtering". That one depends on the directory structure and how the Parquet files are organized on disk. After this PR is merged, I'll create a separate PR with a good partition filtering example.

Thanks again for reviewing and let me know if any additional changes are needed! Really appreciate your feedback!

Sep 08 '20 16:09 MrPowers

dask-examples dask-examples copied to clipboard

Add Parquet predicate pushdown filtering example

dask-examples
dask-examples copied to clipboard