horaedb icon indicating copy to clipboard operation
horaedb copied to clipboard

Late materialization for paquet reading

Open Rachelint opened this issue 1 year ago • 1 comments

Describe This Problem

We found one of the cpu bottlenecks for query is parquet's decoding in production, and late materialization is the effective method for optimizing this.

Proposal

Parquet's late materialization impl is too naive now, we must be very careful when using it, so as I see filters can be pushed down for late materialization should satisfy following conditions now:

  • Should contain just a single column (just need to pull one column for eval).

  • Should be selective enough(such as =, in).

  • We should sort the filters according to the encoded columns size in them. However, it is too tired for us users to use this feature, maybe we need to help to improve this in parquet.

  • [x] Impl first version late materialization following design above.

  • [x] Define metrics to measure the effect of late materialization.

  • [x] Improve late materialization impl in parquet.

Additional Context

No response

Rachelint avatar Oct 08 '23 08:10 Rachelint

Suggest giving links to parquet related information.

tanruixiang avatar Oct 11 '23 02:10 tanruixiang