horaedb
horaedb copied to clipboard
Late materialization for paquet reading
Describe This Problem
We found one of the cpu bottlenecks for query is parquet's decoding in production, and late materialization
is the effective method for optimizing this.
Proposal
Parquet's late materialization impl is too naive now, we must be very careful when using it, so as I see filters can be pushed down for late materialization should satisfy following conditions now:
-
Should contain just a single column (just need to pull one column for eval).
-
Should be selective enough(such as
=
,in
). -
We should sort the filters according to the encoded columns size in them. However, it is too tired for us users to use this feature, maybe we need to help to improve this in parquet.
-
[x] Impl first version late materialization following design above.
-
[x] Define metrics to measure the effect of late materialization.
-
[x] Improve late materialization impl in parquet.
Additional Context
No response
Suggest giving links to parquet related information.