horaedb
horaedb copied to clipboard
More advanced techniques to read parquet files efficiently
Describe This Problem
Usually the IO part of a query is the most time consuming, so reducing time spent on this would improve query latency quietly a lot.
In current implementation, we have already applies some tricks to optimize this, to name a few:
- concurrent reads even for one file
- min/max prune
- custom bloom filter prune
There is an awesome blog written by @tustvold and @alamb introducing some more advanced techniques to further improve read speed, which is definitely a must-read for developer in Arrow ecosystem.
Proposal
Explore ideas introduced in Querying Parquet with Millisecond Latency, Some notable ideas are:
- Page prune
- Late materialization
- Decode optimization, especially dictionary encoding
Additional Context
No response