timescaledb icon indicating copy to clipboard operation
timescaledb copied to clipboard

Continuous aggregate on top of another continuous aggregate

Open tobiasdirksen opened this issue 5 years ago • 23 comments

I'm trying to create a continuous aggregate on top of another continuous aggregate and gets invalid SELECT query for continuous aggregate. Is this not supported? I can't find any information on this. See below example where the first query works find and the second fails. This is a very important feature, since it allows me to make a weighted average.

create view readings_5min
 with (timescaledb.continuous)
 as
 select 
 time_bucket('5 minutes',time) as time,
 "serial",
 avg(value) as mean,
 from readings_raw
 group by time_bucket('5 minutes',time),"serial"
   
 create view readings_hour
 with (timescaledb.continuous)
 as
 select 
 time_bucket('1 hour',time) as time,
 "serial",
 avg(mean) as mean,
 from readings_5min
 group by time_bucket('1 hour',time),"serial"

tobiasdirksen avatar Aug 16 '19 07:08 tobiasdirksen

This is currently not supported.

However, the second view will probably perform just fine as a regular view.

e.g.:

 create view readings_hour
 as
 select 
 time_bucket('1 hour',time) as time,
 "serial",
 avg(mean) as mean,
 from readings_5min
 group by time_bucket('1 hour',time),"serial"

feikesteenbergen avatar Aug 16 '19 10:08 feikesteenbergen

As an aside, taking the average of 12 5-minute averages may not be an accurate value for your purpose, consider the following example:

interval Measurements Sum of Values Average
00-04 100 200 2
05-09 20 80 4

Your weighted average will get you 3 as an answer here, whereas the actual average of the measurements is (200+80)/(100+20) = 2.33

You could solve that by not just storing the average, but the count and sum, and the calculate the avg manually, e.g.:

 select 
 time_bucket('5 minutes',time) as time,
 "serial",
 avg(value) as mean,
count(value) as count,
sum(value) as sum
[...]

And then in the readings_hour view, you'd do the following:

 create view readings_hour
 as
 select 
 time_bucket('1 hour',time) as time,
 "serial",
 sum(sum)/sum(count) AS mean,
 from readings_5min
 group by time_bucket('1 hour',time),"serial"

feikesteenbergen avatar Aug 16 '19 10:08 feikesteenbergen

Thanks for clarifying. That's exactly what I have done for now, but I'm afraid the performance decline when my data increase, or will that not be the case, since the underlying chunk-size is static? Not quite sure how it works.

Anyways, it would have to be a materialized view in the longer run, since I have different retention policies for the different sampling rates, and then the performance would decline significantly. I know that the cascade_to_materializations can't be set to false yet for for continuous aggregates, but I'm hoping to get the option soon :)

tobiasdirksen avatar Aug 16 '19 10:08 tobiasdirksen

Well, that depends of course on the data. I'm working with utility data, and then the weighted average would be preferred, especially if I don't get data regularly. Imagine if a peak usage is from 07:00-07:05, and I have 5 observations between 07:00-07:05, but 10 observations in the other 5-minute-intervals. Then the straight average would get too low, and the weighted would be more accurate.

tobiasdirksen avatar Aug 16 '19 11:08 tobiasdirksen

Another option is to specify 2 separate continuous aggregates, that does work with timescale 1.4:

create view readings_5min
 with (timescaledb.continuous)
 as
 select 
 time_bucket('5 minutes',time) as time,
 "serial",
 avg(value) as mean,
 from readings_raw
 group by time_bucket('5 minutes',time),"serial"
   
 create view readings_hour
 with (timescaledb.continuous)
 as
 select 
 time_bucket('1 hour',time) as time,
 "serial",
 avg(mean) as mean,
 from readings_raw
 group by time_bucket('1 hour',time),"serial"

feikesteenbergen avatar Aug 16 '19 13:08 feikesteenbergen

Well, that doesn't give me the weigthed average. I have also tried with select as below, but that is (maybe obviously) not supported either.

create view readings_hour_test
  with (timescaledb.continuous)
  as 
  select 
  time_bucket('1 hour',time) as time,
  devicetype,
  manufacturer,
  "serial",
  "type",
  unit,
  sum(count) as count,
  count(*) as count_5min_packets,
  min(min) as min,
  avg(mean) as mean,
  max(max) as max,
  first(first,time) as first,
  first(first_time,time) as first_time,
  last(last,time) as last,
  last(last_time,time) as last_time
  from
  (select 
  time_bucket('5 minute',time) as time,
  devicetype,
  manufacturer,
  "serial",
  "type",
  unit,
  count(*) as count,
  min(value) as min,
  avg(value) as mean,
  max(value) as max,
  first(value,time) as first,
  first(time,time) as first_time,
  last(value,time) as last,
  last(time,time) as last_time
  from readings_raw
  group by time_bucket('5 minute',time),devicetype,manufacturer,"serial","type",unit
  ) as readings_raw
  group by time_bucket('1 hour',time),devicetype,manufacturer,"serial","type",unit  

tobiasdirksen avatar Aug 20 '19 10:08 tobiasdirksen

Thanks for raising this issue, @tobiasdirksen!

I also have a need for being able to define continuous aggregate (cagg) on top of another one, but for a different use case.

In my case, I have a decent amount data getting piped into a raw data hypertable (typically, at least 40 points per second, but could spike up to > 1k per second). Over the course of one day, I would accumulate around at least 3.5 million data points.

The data need to be summarized into 1m, 1h and 1d reports, available on demand, once every second.

Because it isn't currently possible to have cagg defined on top of another one, the queries performance against the down sampled caggs are not very good, particularly for the 1d reports.

It would be much more efficient for our use case if a lower resolution cagg could be defined on top of a higher resolution cagg to take advantage of the materialized data.

For us this means that ideally, we want to be able to have the following roll-ups

raw data → 1m → 1h → 1d

dhl avatar May 20 '20 03:05 dhl

+1, I have exactly the same use case, but on a larger scale, up to 5 years of data I've just started with Timescale and just assumed that I could create an hierarchy of caggs, until I discovered this bug.

Kazmirchuk avatar Aug 17 '20 11:08 Kazmirchuk

Just for sharing. This is how I solve this issue for time being. I create a materialized view (MV) which use cagg to populate it's data. then I refresh the view using cron in the midnight

CREATE MATERIALIZED VIEW bql_country_monthly as
select  date_trunc('month', (NOW() - INTERVAL '1 day')) as bucket, sum(sum), country, node, view
from bql_country_hourly
where bucket >= date_trunc('month', CURRENT_DATE) and bucket < date_trunc('day', (NOW() - INTERVAL '1 day')) 
group by node, view, country
order by sum desc;

This query will populate MV for this month. I wonder what happen to next month? Will data of previous month still be available ?

abigoroth avatar Oct 28 '20 11:10 abigoroth

"raw data → 1m → 1h → 1d" we expected the same.

As we have 30 GB of new data every hour, and want to keep that data for at least two years, aggregation is the only way to go. We run custom background tasks that aggregate the data over multiple intervals and removes unused data. The aggregation function of timescale seemed to be what we need to get rid of our own coded system.

At the moment, the only solution we see is exporting data from one aggregation table into another, which is more of a hack than a solution.

KimSchneider avatar Feb 10 '21 09:02 KimSchneider

I too think this features is important, for now the only way to do this is to setup a cron that will do query to the cagg and insert into the new table. I think this features is really important if the data can be huge like data process from logs. Some logs like named querylog can grow up to 1-2GB just in an hour. Try think if the aggregate data is collected for 30 days.. it going to be hard.

emanzx avatar Apr 03 '21 10:04 emanzx

Yes, i wanted to chime in on this too. Think "Yearly" report would be super heavy, but useful for some use cases. I kinda hoped that this would work to chain aggregates

Inviz avatar Apr 21 '21 18:04 Inviz

I have similar use-case and would like to biuld cagg on cagg. I insert about 10 raws a second but with 300 columns. Since I insert quite old data and on irregular intervals (let's say I save collected by experiments data to the DB), I have to refresh cagg manually and wait for this operation to finalize before I proceed with next portion of data.

I need 1 min, 1 hour and 1 day aggregates. For me it takes about 4 seconds to refresh 1 hour cagg which is more or less acceptable, but it takes 2-3 minutes to refresh 1 day cagg and this is not acceptable. So I'd like to build 1 day aggregates on top of 1 min cagg. At the moment I'll suspend the development of this feature with hope that this will be possible somewhere in the next TimescaleDB releases.

ostapkostyk avatar Jul 30 '21 10:07 ostapkostyk

I had to pause development of my feature as well, because it was involving too many manual triggers and policies. There were just too many compromises on every level

Inviz avatar Jul 30 '21 13:07 Inviz

It's interesting that the materialized view behind the cagg is a hypertable, which is a necessity for defining a cagg. However, when looking into the materialized view, it appears that its intermediate state is stored as binary blobs.

As a workaround for now, is there a way to basically manually 'roll up' that intermediate state into the new aggregate by directly building the query on top of the underlying materialization hypertable?

poelstra-sioux avatar Aug 13 '21 13:08 poelstra-sioux

I created a simple prototype that works like caggs on top of caggs but with two hypertables: https://gist.github.com/jonatas/3f8cc056742264898e3b437bcdc548ce

jonatas avatar Sep 14 '21 02:09 jonatas

Any progress on this front?

iskandergaba avatar Feb 19 '22 21:02 iskandergaba

Hey guys, do you have a plan to pick this up anytime soon?

dhawal1248 avatar May 05 '22 10:05 dhawal1248

+1. without capability to stack aggregating views DB server has to have a flat set of aggregating views for different time buckets (hour, daily, week, etc) - problem is that every view repeats job on aggregating of its neighbor on the left, i.e. daily view should aggregate hourly chunks instead of raw hyper table. Bigger time bucket - recalculation is more costly. For insert rate 100 tps - minute - 60 * 100 => 6k is 1 rows so recalculating hourly view in terms of minute aggs is just 60 rows mean while getting same result from raw hyper table Postgresql server should digest 3600 * 100 => 360k of rows for last hour. For daily it is 24 worse etc..

yaitskov avatar Jun 06 '22 19:06 yaitskov

+1

Dashing-Nelson avatar Jul 12 '22 14:07 Dashing-Nelson

+1 from me too, I kind expect this to already work :/ My base table has data in about 0.1s refresh interval, on top of that I have 1s continuous aggregate and on top of that I need a 1m one

dragonnn avatar Jul 25 '22 08:07 dragonnn

This is a very important feature for us. The daily aggregation is taking 6 hours to calculate considering we are using raw data. We have hourly aggregation that we can't use. We really need this feature to huge data set, if this not come sone w'll have to considere other approach.

ricardo-s-ferreira-alb avatar Aug 09 '22 11:08 ricardo-s-ferreira-alb

Progress! 👀

https://github.com/timescale/timescaledb/pull/4668

mfreed avatar Sep 07 '22 16:09 mfreed

Thanks @fabriziomello for creating the PR. I am very sure that quite a few of us will be very happy once this can be merged in!

ssmoss avatar Oct 02 '22 19:10 ssmoss