timescaledb
timescaledb copied to clipboard
Continuous aggregate on top of another continuous aggregate
I'm trying to create a continuous aggregate on top of another continuous aggregate and gets invalid SELECT query for continuous aggregate
. Is this not supported? I can't find any information on this. See below example where the first query works find and the second fails. This is a very important feature, since it allows me to make a weighted average.
create view readings_5min
with (timescaledb.continuous)
as
select
time_bucket('5 minutes',time) as time,
"serial",
avg(value) as mean,
from readings_raw
group by time_bucket('5 minutes',time),"serial"
create view readings_hour
with (timescaledb.continuous)
as
select
time_bucket('1 hour',time) as time,
"serial",
avg(mean) as mean,
from readings_5min
group by time_bucket('1 hour',time),"serial"
This is currently not supported.
However, the second view will probably perform just fine as a regular view.
e.g.:
create view readings_hour
as
select
time_bucket('1 hour',time) as time,
"serial",
avg(mean) as mean,
from readings_5min
group by time_bucket('1 hour',time),"serial"
As an aside, taking the average of 12 5-minute averages may not be an accurate value for your purpose, consider the following example:
interval | Measurements | Sum of Values | Average |
---|---|---|---|
00-04 | 100 | 200 | 2 |
05-09 | 20 | 80 | 4 |
Your weighted average will get you 3
as an answer here, whereas the actual average of the measurements is (200+80)/(100+20) = 2.33
You could solve that by not just storing the average, but the count and sum, and the calculate the avg manually, e.g.:
select
time_bucket('5 minutes',time) as time,
"serial",
avg(value) as mean,
count(value) as count,
sum(value) as sum
[...]
And then in the readings_hour
view, you'd do the following:
create view readings_hour
as
select
time_bucket('1 hour',time) as time,
"serial",
sum(sum)/sum(count) AS mean,
from readings_5min
group by time_bucket('1 hour',time),"serial"
Thanks for clarifying. That's exactly what I have done for now, but I'm afraid the performance decline when my data increase, or will that not be the case, since the underlying chunk-size is static? Not quite sure how it works.
Anyways, it would have to be a materialized view in the longer run, since I have different retention policies for the different sampling rates, and then the performance would decline significantly. I know that the cascade_to_materializations
can't be set to false
yet for for continuous aggregates, but I'm hoping to get the option soon :)
Well, that depends of course on the data. I'm working with utility data, and then the weighted average would be preferred, especially if I don't get data regularly. Imagine if a peak usage is from 07:00-07:05, and I have 5 observations between 07:00-07:05, but 10 observations in the other 5-minute-intervals. Then the straight average would get too low, and the weighted would be more accurate.
Another option is to specify 2 separate continuous aggregates, that does work with timescale 1.4:
create view readings_5min
with (timescaledb.continuous)
as
select
time_bucket('5 minutes',time) as time,
"serial",
avg(value) as mean,
from readings_raw
group by time_bucket('5 minutes',time),"serial"
create view readings_hour
with (timescaledb.continuous)
as
select
time_bucket('1 hour',time) as time,
"serial",
avg(mean) as mean,
from readings_raw
group by time_bucket('1 hour',time),"serial"
Well, that doesn't give me the weigthed average. I have also tried with select as below, but that is (maybe obviously) not supported either.
create view readings_hour_test
with (timescaledb.continuous)
as
select
time_bucket('1 hour',time) as time,
devicetype,
manufacturer,
"serial",
"type",
unit,
sum(count) as count,
count(*) as count_5min_packets,
min(min) as min,
avg(mean) as mean,
max(max) as max,
first(first,time) as first,
first(first_time,time) as first_time,
last(last,time) as last,
last(last_time,time) as last_time
from
(select
time_bucket('5 minute',time) as time,
devicetype,
manufacturer,
"serial",
"type",
unit,
count(*) as count,
min(value) as min,
avg(value) as mean,
max(value) as max,
first(value,time) as first,
first(time,time) as first_time,
last(value,time) as last,
last(time,time) as last_time
from readings_raw
group by time_bucket('5 minute',time),devicetype,manufacturer,"serial","type",unit
) as readings_raw
group by time_bucket('1 hour',time),devicetype,manufacturer,"serial","type",unit
Thanks for raising this issue, @tobiasdirksen!
I also have a need for being able to define continuous aggregate (cagg) on top of another one, but for a different use case.
In my case, I have a decent amount data getting piped into a raw data hypertable (typically, at least 40 points per second, but could spike up to > 1k per second). Over the course of one day, I would accumulate around at least 3.5 million data points.
The data need to be summarized into 1m, 1h and 1d reports, available on demand, once every second.
Because it isn't currently possible to have cagg defined on top of another one, the queries performance against the down sampled caggs are not very good, particularly for the 1d reports.
It would be much more efficient for our use case if a lower resolution cagg could be defined on top of a higher resolution cagg to take advantage of the materialized data.
For us this means that ideally, we want to be able to have the following roll-ups
raw data → 1m → 1h → 1d
+1, I have exactly the same use case, but on a larger scale, up to 5 years of data I've just started with Timescale and just assumed that I could create an hierarchy of caggs, until I discovered this bug.
Just for sharing. This is how I solve this issue for time being. I create a materialized view (MV) which use cagg to populate it's data. then I refresh the view using cron in the midnight
CREATE MATERIALIZED VIEW bql_country_monthly as
select date_trunc('month', (NOW() - INTERVAL '1 day')) as bucket, sum(sum), country, node, view
from bql_country_hourly
where bucket >= date_trunc('month', CURRENT_DATE) and bucket < date_trunc('day', (NOW() - INTERVAL '1 day'))
group by node, view, country
order by sum desc;
This query will populate MV for this month. I wonder what happen to next month? Will data of previous month still be available ?
"raw data → 1m → 1h → 1d" we expected the same.
As we have 30 GB of new data every hour, and want to keep that data for at least two years, aggregation is the only way to go. We run custom background tasks that aggregate the data over multiple intervals and removes unused data. The aggregation function of timescale seemed to be what we need to get rid of our own coded system.
At the moment, the only solution we see is exporting data from one aggregation table into another, which is more of a hack than a solution.
I too think this features is important, for now the only way to do this is to setup a cron that will do query to the cagg and insert into the new table. I think this features is really important if the data can be huge like data process from logs. Some logs like named querylog can grow up to 1-2GB just in an hour. Try think if the aggregate data is collected for 30 days.. it going to be hard.
Yes, i wanted to chime in on this too. Think "Yearly" report would be super heavy, but useful for some use cases. I kinda hoped that this would work to chain aggregates
I have similar use-case and would like to biuld cagg on cagg. I insert about 10 raws a second but with 300 columns. Since I insert quite old data and on irregular intervals (let's say I save collected by experiments data to the DB), I have to refresh cagg manually and wait for this operation to finalize before I proceed with next portion of data.
I need 1 min, 1 hour and 1 day aggregates. For me it takes about 4 seconds to refresh 1 hour cagg which is more or less acceptable, but it takes 2-3 minutes to refresh 1 day cagg and this is not acceptable. So I'd like to build 1 day aggregates on top of 1 min cagg. At the moment I'll suspend the development of this feature with hope that this will be possible somewhere in the next TimescaleDB releases.
I had to pause development of my feature as well, because it was involving too many manual triggers and policies. There were just too many compromises on every level
It's interesting that the materialized view behind the cagg is a hypertable, which is a necessity for defining a cagg. However, when looking into the materialized view, it appears that its intermediate state is stored as binary blobs.
As a workaround for now, is there a way to basically manually 'roll up' that intermediate state into the new aggregate by directly building the query on top of the underlying materialization hypertable?
I created a simple prototype that works like caggs on top of caggs but with two hypertables: https://gist.github.com/jonatas/3f8cc056742264898e3b437bcdc548ce
Any progress on this front?
Hey guys, do you have a plan to pick this up anytime soon?
+1. without capability to stack aggregating views DB server has to have a flat set of aggregating views for different time buckets (hour, daily, week, etc) - problem is that every view repeats job on aggregating of its neighbor on the left, i.e. daily view should aggregate hourly chunks instead of raw hyper table. Bigger time bucket - recalculation is more costly. For insert rate 100 tps - minute - 60 * 100 => 6k is 1 rows so recalculating hourly view in terms of minute aggs is just 60 rows mean while getting same result from raw hyper table Postgresql server should digest 3600 * 100 => 360k of rows for last hour. For daily it is 24 worse etc..
+1
+1 from me too, I kind expect this to already work :/ My base table has data in about 0.1s refresh interval, on top of that I have 1s continuous aggregate and on top of that I need a 1m one
This is a very important feature for us. The daily aggregation is taking 6 hours to calculate considering we are using raw data. We have hourly aggregation that we can't use. We really need this feature to huge data set, if this not come sone w'll have to considere other approach.
Progress! 👀
https://github.com/timescale/timescaledb/pull/4668
Thanks @fabriziomello for creating the PR. I am very sure that quite a few of us will be very happy once this can be merged in!