horaedb icon indicating copy to clipboard operation
horaedb copied to clipboard

More advanced techniques to read parquet files efficiently

Open jiacai2050 opened this issue 3 years ago • 0 comments

Describe This Problem

Usually the IO part of a query is the most time consuming, so reducing time spent on this would improve query latency quietly a lot.

In current implementation, we have already applies some tricks to optimize this, to name a few:

  1. concurrent reads even for one file
  2. min/max prune
  3. custom bloom filter prune

There is an awesome blog written by @tustvold and @alamb introducing some more advanced techniques to further improve read speed, which is definitely a must-read for developer in Arrow ecosystem.

Proposal

Explore ideas introduced in Querying Parquet with Millisecond Latency, Some notable ideas are:

  • Page prune
  • Late materialization
  • Decode optimization, especially dictionary encoding

Additional Context

No response

jiacai2050 avatar Jan 28 '23 07:01 jiacai2050