Support for Materialised Query Rewrite in Apache Pinot using MV Tables and Calcite
Currently, Apache Pinot supports star-tree indexes for pre-aggregation, but sometimes creating star-trees for wide-column tables is not effective due to excessive disk space consumption. To address this, introduce support for Materialised Views (MV) in Pinot:
MV Creation: Allow materialised view tables to be created externally (e.g., in Hive) and ingested into Pinot as MV tables. Metadata Mapping: Maintain metadata linking MV tables to the main table, including the original query used for MV creation. Query Rewrite: Integrate Calcite-based query rewriting to automatically select the most cost-effective MV table when serving queries. Further Optimisation: Enable star-tree indexes on MV tables for frequently accessed aggregations. Since MV tables are smaller, star-trees will consume less space, improving performance. Goal: Achieve multi-level aggregation and efficient real-time analytics for wide-column tables without prohibitive storage costs.
Pros
- Significant performance boost for ad-hoc queries and complex aggregations.
- Reduced disk space usage compared to star-trees on wide-column tables.
- Flexible design leveraging existing MV creation tools like Hive.
Cons
- Users must refresh MV tables when underlying data changes.
- Best suited for offline tables rather than real-time ingestion scenarios.
I have implemented this internally in my $dayjob to support the same and advantage we get is very high and able to serve adhoc queries on tables with 100s of columns(some are high cardinality columns) and multiple 100s of billions of rows and able to meet our SLAs of 1 min without this it will be hours.
Great proposal! Supporting materialized view would be an awesome optimization.
As a side question, Pinot supports star-tree index on a subset of columns, and star-tree index itself is a partial materialized view. Do you observe excessive disk space consumption when only subset of columns are included?
@Jackie-Jiang Thank you.
Till some extent we are able to use star tree index with around 30 columns whose cardinality is around 100s. Since our requirement is adhoc query analysis we cannot make a star tree which fits most of the cases mainly when we have high cardinality columns(20000+). Where as our MVs serves 95% adhoc queries.