datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Write a blog about parquet predicate pushdown

Open alamb opened this issue 3 years ago • 2 comments

I think it would be super valuable to write a blog post about all the work from @thinkharderdev @Ted-Jiang, @tustvold and others to make reading from parquet in DataFusion very fast

I have gathered a list of items on https://github.com/apache/arrow-datafusion/issues/3462 which will perhaps spark some thoughts / ideas.

alamb avatar Sep 13 '22 10:09 alamb

I made a bit of a start on collecting some data for this. In particular I created something to allow generating parquet files for use in some test benchmarks here.

The basic idea was to show the performance of a selection of relatively simple queries across datafusion-cli and compare it to some other systems like duckdb, trino, polars, spark, etc... Hopefully this would provide ample opportunity to describe the various work that has been performed over the last 9 or so months, and would ground the performance in easily understandable terms.

We could also potentially run benchmarks with various forms of pushdown disabled, to quantify the impact of those changes. Or against older versions of the parquet reader, to quantify the performance impact of things like dictionary preservation

tustvold avatar Sep 13 '22 11:09 tustvold

Hopefully this would provide ample opportunity to describe the various work that has been performed over the last 9 or so months, and would ground the performance in easily understandable terms.

I agree -- this would be a great start

alamb avatar Sep 13 '22 19:09 alamb

We have a draft of this post ready here https://github.com/apache/arrow-site/pull/280

alamb avatar Nov 30 '22 21:11 alamb

For a variety of reasons we posted this on the Influxdata site first: https://www.influxdata.com/blog/querying-parquet-millisecond-latency/ -- very cool stuff

alamb avatar Dec 07 '22 20:12 alamb