databend icon indicating copy to clipboard operation
databend copied to clipboard

Feature: Temporal Aggregating Index

Open Max-Meldrum opened this issue 9 months ago • 6 comments

I recently wrote a blog post about speeding up temporal aggregation queries significantly in DataFusion by using µWheel.

µWheel could potentially be used by Databend also to implement a Temporal version of Aggregating Index that pre-materializes aggregates across time.

I'd be happy to help if there is interest.

Max-Meldrum avatar May 16 '24 14:05 Max-Meldrum

I wondered if it works with the distributed warehouse?

sundy-li avatar May 22 '24 00:05 sundy-li

I wondered if it works with the distributed warehouse?

I would say that µWheel can be used in two different modes:

Stream Mode:

This mode assumes that the wheel will be incrementally be updated by a streaming system. µWheel is designed around low watermarking, meaning it is up to the user/system to advance the internal time to cause aggregates to roll up over time.

A low watermark w indicates that all records with timestamps t where t <= w have been ingested. This means a wheel will start rejecting data with timestamps below the watermark. This assumption may not be fully compatible with non-streaming systems.

Index Mode:

However, if you are working with static read-only datasets that are time partitioned, then µWheel is ideal as an index on top of this data.

So, to answer the question. If the distributed warehouse does not adopt low watermarking, it is still possible to use µWheel in Index mode. The result from different µWheel instances can be merged together if the data is sharded.

Max-Meldrum avatar May 22 '24 08:05 Max-Meldrum

An early prototype of a WheelManagercan be seen here: https://github.com/Max-Meldrum/wheel_manager/blob/main/examples/nyc_taxi.rs

It manages wheels over a set of parquet files. Right now, it only builds a COUNTwheel for the table and MinMax wheels for specified columns that can be used for temporal pruning of queries.

For instance, a MinMax wheel on the fare_amountcolumn can be used to check whether the following query can be skipped:

SELECT * FROM yellow_tripdata
WHERE tpep_dropoff_datetime >= '?' and < '?'
AND fare_amount > 1000

Max-Meldrum avatar Jun 07 '24 09:06 Max-Meldrum

Hi, @Max-Meldrum Thanks for your sharing. It looks interesting. I am studying the design and implementation of the µWheel, and doing some performance testing. We'll consider using the µWheel for aggregate index.

b41sh avatar Jun 07 '24 11:06 b41sh

Hi, @Max-Meldrum Thanks for your sharing. It looks interesting. I am studying the design and implementation of the µWheel, and doing some performance testing. We'll consider using the µWheel for aggregate index.

Great! I'm available on the Databend slack if anything is unclear or if you'd like a copy of the µWheel research paper.

Max-Meldrum avatar Jun 07 '24 12:06 Max-Meldrum

有没有µWheel的架构原理图和说明@Max-Meldrum

lixiaoyong12 avatar Oct 19 '24 11:10 lixiaoyong12

有没有µWheel的架构原理图和说明@Max-Meldrum

https://dl.acm.org/doi/10.1145/3629104.3666031 for paper

https://github.com/uwheel/uwheel/blob/main/DESIGN.md for repo design

Max-Meldrum avatar Oct 22 '24 12:10 Max-Meldrum