databend
databend copied to clipboard
Feature: Temporal Aggregating Index
I recently wrote a blog post about speeding up temporal aggregation queries significantly in DataFusion by using µWheel.
µWheel could potentially be used by Databend also to implement a Temporal version of Aggregating Index that pre-materializes aggregates across time.
I'd be happy to help if there is interest.
I wondered if it works with the distributed warehouse?
I wondered if it works with the distributed warehouse?
I would say that µWheel can be used in two different modes:
Stream Mode:
This mode assumes that the wheel will be incrementally be updated by a streaming system. µWheel is designed around low watermarking, meaning it is up to the user/system to advance the internal time to cause aggregates to roll up over time.
A low watermark w
indicates that all records with timestamps t
where t <= w
have been ingested. This means
a wheel will start rejecting data with timestamps below the watermark. This assumption may not be fully compatible
with non-streaming systems.
Index Mode:
However, if you are working with static read-only datasets that are time partitioned, then µWheel is ideal as an index on top of this data.
So, to answer the question. If the distributed warehouse does not adopt low watermarking, it is still possible to use µWheel in Index mode. The result from different µWheel instances can be merged together if the data is sharded.
An early prototype of a WheelManager
can be seen here: https://github.com/Max-Meldrum/wheel_manager/blob/main/examples/nyc_taxi.rs
It manages wheels over a set of parquet files. Right now, it only builds a COUNT
wheel for the table and MinMax
wheels for specified columns that can be used for temporal pruning of queries.
For instance, a MinMax
wheel on the fare_amount
column can be used to check whether the following query can be skipped:
SELECT * FROM yellow_tripdata
WHERE tpep_dropoff_datetime >= '?' and < '?'
AND fare_amount > 1000
Hi, @Max-Meldrum Thanks for your sharing. It looks interesting. I am studying the design and implementation of the µWheel, and doing some performance testing. We'll consider using the µWheel for aggregate index.
Hi, @Max-Meldrum Thanks for your sharing. It looks interesting. I am studying the design and implementation of the µWheel, and doing some performance testing. We'll consider using the µWheel for aggregate index.
Great! I'm available on the Databend slack if anything is unclear or if you'd like a copy of the µWheel research paper.
有没有µWheel的架构原理图和说明@Max-Meldrum
有没有µWheel的架构原理图和说明@Max-Meldrum
https://dl.acm.org/doi/10.1145/3629104.3666031 for paper
https://github.com/uwheel/uwheel/blob/main/DESIGN.md for repo design