timescaledb
timescaledb copied to clipboard
Create continuous aggregate from multiple tables
Hello everybody,
I have a question regarding time_bucket and continuous aggregation for multiple tables. I want to use time_bucket() to match timestamps across different time series. Exactly like in this video at minute 38:51: https://youtu.be/lkqmfsWx6GM?t=2331
How should i create a query for a continuous aggregation to get metric_1 of table_1 and metric_2 of table_2 in one continious aggregation view. In our case there could be different numbers of rows. Some will match and some rows will contain null values when "joined". So i am planning to use time_bucket_gapfill() when i query the continuous aggregation view later. Is there an example on how to create a continuous aggregation of multiple tables? In our case there has to be a "join" on time and a device_id (vin).
As an multi step solution i tried creating continuous aggregates for every table. ca_one and ca_two are both continuous aggregates with 2 second time buckets.
The following query is successful and return exactly what i neeed:
SELECT
ca_one.time_2s,
ca_one.vin,
column_first_table,
column_second_table,
FROM
ca_one
LEFT JOIN ca_two ON ca_one.time_2s = ca_two.time_2s
AND ca_one.vin = ca_two.vin
GROUP BY
ca_one.time_2s,
ca_one.vin,
column_first_table,
column_second_table;
But when i try to use it in a continuous aggregate i get the error:
ERROR: only 1 hypertable is permitted in SELECT query for continuous aggregate
I don't know how to get my combined results time efficiently. Does someone have a solution for this kind of (i think general) use case? Perhaps i am missing something obvious.
Regards Julius
You would create 2 cont agg. views and join the views when querying. Cont aggs can only be defined for a hypertable.
yes this is of course bad for performance because it will join a lot of data in a live query.
a different approach would be to use wider tables with many columns. this can have impact on performance too. and there will be a lot of null values in the table. (i read it is not too bad to have null values but it is not great either.
Perhaps someone has a solution for this general problem?
This is a known limitation. @bboule
is it supported to join "regular" tables when defining a continous aggegrate? I'm currently experiencing the same error message in that case.
@swiffer It is not supported for regular table. From clause can contain just single hypertable.
This will be a really useful feature to have.
Is there some news for this feature?
This would be extremely helpful but I presume this is more of a limitation on psql in general rather than a timescale specific issue. Would be a game changer though, definitely.
This is really a big limitation that not even a regular table can be joined for continuous aggregates. We would need this to pre-aggregate clustered data with (hypertable-timeseries-)rows belonging to multiple clusters. Sth like:
CREATE MATERIALIZED VIEW cluster_data_daily
WITH (timescaledb.continuous) AS
SELECT time_bucket('P1D', md.local_time) AS local_date,
c.cluster_definition_id as cluster_definition_id,
MIN(md.value) min,
AVG(md.value) avg,
MAX(md.value) max
FROM measurement_data md
JOIN clusters c ON c.location_id = md.location_id
GROUP BY local_date,
c.cluster_definition_id;
I even find the error message ERROR: only one hypertable allowed in continuous aggregate view
quite misleading as it leaves it open if other regular tables are allowed or not. Only seeing the examples here (https://github.com/timescale/timescaledb/blob/master/tsl/test/expected/continuous_aggs_errors.out) makes the limitations obvious.
There should definitely be a section about (all) those limitations in the docs. I couldn't find any information regarding this. https://docs.timescale.com/timescaledb/latest/how-to-guides/continuous-aggregates/ Maybe I missed sth?
There should definitely be a section about (all) those limitations in the docs. I couldn't find any information regarding this.
Just stumbled accross a section in the docs that mentions the limitations quite clearly:
Note that continuous aggregates have some limitations of what types of queries they can support, described in more length below. For example, the FROM clause must provide only one hypertable, i.e., no joins, CTEs, views or subqueries are supported. The GROUP BY clause must include a time bucket on the hypertable's time column, and all aggregates must be parallelizable.
It might still be worth to reference that section from the How-To Guide as @mineralf proposed.
Not sure if TimescaleDB folks use votes for prioritization, but we should upvote the original comment. For some reason all the votes have ended up on this comment instead. If we count that comment's votes instead, this issue would rank as # 8.
In any case, I've hacked up a workaround to be able to easily create a "continuous aggregate" that can reference multiple tables. Might be useful for others:
CREATE OR REPLACE FUNCTION round_timestamp(ts timestamp, round_interval interval) RETURNS timestamp IMMUTABLE LANGUAGE sql AS $$
SELECT to_timestamp(round(extract(epoch from ts) / extract(epoch from round_interval))::bigint * extract(epoch from round_interval))
$$;
CREATE OR REPLACE PROCEDURE create_custom_continuous_aggregate(
name text,
query text,
start_offset interval default '5m',
end_offset interval default '4m',
schedule interval default '1m',
retention interval default '1y',
chunk_interval interval default '24h',
schedule_round boolean default true
) LANGUAGE plpgsql AS $$
DECLARE
job_id integer;
BEGIN
EXECUTE 'CREATE TABLE '||quote_ident('_' || name)||' AS WITH agg AS (SELECT null::timestamp AS ts_begin, null::timestamp AS ts_end) '||query||' WITH NO DATA';
PERFORM create_hypertable(quote_ident('_'||name), 'time', chunk_time_interval => chunk_interval);
EXECUTE 'ALTER TABLE '||quote_ident('_' || name)||' SET (timescaledb.compress)';
PERFORM add_compression_policy(quote_ident('_' || name), interval '1h');
EXECUTE '
CREATE OR REPLACE PROCEDURE _'||name||'_update(job_id int, config jsonb) LANGUAGE plpgsql AS $p$
#variable_conflict use_variable
declare
start_offset interval;
end_offset interval;
start_ts timestamp;
end_ts timestamp;
begin
SELECT jsonb_object_field_text(config, ''start_offset'')::interval INTO STRICT start_offset;
SELECT jsonb_object_field_text(config, ''end_offset'')::interval INTO STRICT end_offset;
SELECT round_timestamp(now()::timestamp - start_offset, (SELECT schedule_interval FROM timescaledb_information.jobs jobs WHERE jobs.job_id=job_id)) INTO STRICT start_ts;
SELECT round_timestamp(now()::timestamp - end_offset, (SELECT schedule_interval FROM timescaledb_information.jobs jobs WHERE jobs.job_id=job_id)) INTO STRICT end_ts;
DELETE FROM _'||name||' WHERE time >= start_ts AND time < end_ts;
INSERT INTO _'||name||' WITH agg AS (SELECT start_ts AS ts_begin, end_ts AS ts_end) '||query||';
end;
$p$
';
SELECT add_job(('_'||name||'_update')::regproc, schedule, config => ('{"start_offset":"'|| start_offset::text ||'", "end_offset":"'|| end_offset::text ||'"}')::jsonb) INTO job_id;
IF schedule_round THEN
PERFORM alter_job(job_id, next_start => round_timestamp(now()::timestamp, schedule));
END IF;
PERFORM add_retention_policy('_'||name, retention);
EXECUTE '
CREATE VIEW '||name||' AS
WITH agg AS (
SELECT
round_timestamp(now()::timestamp - '''||schedule::text||'''::interval - '''||end_offset::text||'''::interval, '''||schedule::text||'''::interval) AS ts_begin,
round_timestamp(now()::timestamp, '''||schedule::text||'''::interval) AS ts_end
)
SELECT * FROM _'||name||' WHERE time < (SELECT ts_begin FROM agg)
UNION ALL '||query;
END;
$$;
CREATE OR REPLACE PROCEDURE drop_custom_continuous_aggregate(name text) LANGUAGE plpgsql AS $$
BEGIN
EXECUTE 'DROP VIEW '||name;
PERFORM delete_job(job_id) FROM timescaledb_information.jobs WHERE proc_name='_'||name||'_update';
EXECUTE 'DROP TABLE _'||name||' CASCADE';
END;
$$;
With example usage as:
call create_custom_continuous_aggregate('test_agg', start_offset => interval '5m', end_offset => interval '1m', schedule => '1m', query => $$
SELECT
time_bucket(interval '1m', time) AS time,
instance,
host,
sum("bin") AS "bin",
sum("bout") AS "bout",
avg("duration") as "duration"
FROM mystats
WHERE time >= (SELECT ts_begin FROM agg) AND time < (SELECT ts_end FROM agg)
GROUP BY 1, 2, 3
$$);
SELECT * FROM test_agg WHERE time >= now() - interval '10m';
@phemmer I'm getting this error when calling create_custom_continuous_aggregate:
psql:/sandbox/dev/devint/cip-mrts/cms/campaigns/sql/continuous_aggregate_broker_log.sql:20: NOTICE: adding not-null constraint to column "time" DETAIL: Time dimensions cannot have NULL values. CALL
Have you come across this error yourself?
@glep207
@phemmer I'm getting this error when calling create_custom_continuous_aggregate:
psql:/sandbox/dev/devint/cip-mrts/cms/campaigns/sql/continuous_aggregate_broker_log.sql:20: NOTICE: adding not-null constraint to column "time" DETAIL: Time dimensions cannot have NULL values. CALL
Have you come across this error yourself?
I've had this in standard CAGGS when there is not a unique index that includes the "time" column... Or when your time column allows nulls. I cant remember specifically but it was something along those lines...
Does this happen when you create a normal CAGG (without the join)?
I'd be looking at columns in your CAGG which do not have NOT NULL constraints on them (in both your hypertable and your join tables)
@phemmer
Are there any known constraints/issues of your custom implementation? Does it support real-time behaviour?
Actually my "error" was not an error, just an info message. The aggregate was created successfully.
@phemmer why create the continuous aggregate as a hypertable? A plain table would be sufficient in most cases for a continuous aggregate?
Are there any known constraints/issues of your custom implementation?
Yes, lots. The solution makes lots of assumptions, such as that your time column is called "time", or that it's a timestamp and not an integer. The table names where it stores aggregated data have a hard coded prefix. Etc.
The solution is mean to "just work" for common use cases, but it should also be tweaked if it doesn't suit your use case.
Does it support real-time behaviour?
Yes
I'm getting this error when calling create_custom_continuous_aggregate:
As you discovered, that message is a warning, and is the same message you get when creating a normal hypertable without the time column already indexed.
why create the continuous aggregate as a hypertable? A plain table would be sufficient in most cases for a continuous aggregate?
Because there's almost no reason not to, and many reasons to do so. Normal continuous aggregates are also stored in a hypertable, so this isn't unique to the custom solution.
I would debate whether the statement of a normal table being sufficient for most cases is accurate. Normal tables are inefficient for time-series queries. This is especially true when they store large amounts of data. And just because it's an aggregate, doesn't mean it's going to be small.
Though you are more than welcome to tweak the solution to suit your needs/desires.
Has there been any update or plans for this? It seems like it would be super useful.
Has there been any update or plans for this? It seems like it would be super useful.
Yes. If you're looking to join a hypertable with a regular table, which is what this issue tracks, it has recently been implemented in #4874. If you're looking to join multiple hypertables, that is still unimplemented, but is tracked in #3314
@phemmer just to clarify: My original plan always was to join two hypertables in a continuous aggregate and this is what most people in this issue would like to do too.
Hmm. I've just hit this limitation. Curiously, my case is slightly different - I'm attempting a lateral join of the same hypertable simply to unnest an array, then aggregate that array to combine all within a bucket and additional grouping. However, I hit that same problem;
'ERROR: only one hypertable allowed in continuous aggregate view'
Well, there kinda is only one hypertable used in my CAGG ;) The query works fine outside of the CAGG, using the time_bucket() procedure.
Hmm. I've just hit this limitation. Curiously, my case is slightly different - I'm attempting a lateral join of the same hypertable simply to unnest an array, then aggregate that array to combine all within a bucket and additional grouping. However, I hit that same problem;
'ERROR: only one hypertable allowed in continuous aggregate view'
Well, there kinda is only one hypertable used in my CAGG ;) The query works fine outside of the CAGG, using the time_bucket() procedure.
Same here, I have a jsonb column with variable key/value contents. I want to make use of jsonb_each_text
in order to create a CAGG, but for that I need a lateral join.
CREATE MATERIALIZED VIEW test_view
WITH (timescaledb.continuous)
AS
SELECT time_bucket('1 h'::interval, m."time") as bucket, json_data.key, avg(json_data.value::int4)
from test_table t,
jsonb_each_text(t.col_that_has_jsonb) as json_data
group by 1,2
This would be incredibly useful for me. +1