More advanced techniques to read parquet files efficiently

Open jiacai2050 opened this issue 3 years ago • 0 comments

Describe This Problem

Usually the IO part of a query is the most time consuming, so reducing time spent on this would improve query latency quietly a lot.

In current implementation, we have already applies some tricks to optimize this, to name a few:

concurrent reads even for one file
min/max prune
custom bloom filter prune

There is an awesome blog written by @tustvold and @alamb introducing some more advanced techniques to further improve read speed, which is definitely a must-read for developer in Arrow ecosystem.

Proposal

Explore ideas introduced in Querying Parquet with Millisecond Latency, Some notable ideas are:

Page prune
Late materialization
Decode optimization, especially dictionary encoding

Additional Context

No response

Jan 28 '23 07:01 jiacai2050