timescaledb icon indicating copy to clipboard operation
timescaledb copied to clipboard

[Enhancement]: Allow JOINs in CAGG definition

Open maxhertrampf opened this issue 3 years ago • 10 comments

What type of enhancement is this?

API improvement

What subsystems and features will be improved?

Continuous aggregate

What does the enhancement do?

Allow joins with another table in Continuous Aggregates to enable different timezone, origin, and offset parameters per time series. Currently, only a fixed timezone, origin, and offset is planned to be usable per Continous Aggregate.

Background:

We have international weather data, whose readings we store in a single table conditions. The timezone of each weather station is stored in a weather_stations table.

We want to aggregate the weather data from each station in the respective time zone. The following SQL demonstrates our approach and the error message.

CREATE TABLE weather_stations (
  id BIGINT PRIMARY KEY,
  tz TEXT
);

INSERT INTO weather_stations (id, tz) VALUES
	(1, 'Europe/Moscow'),
	(2, 'Europe/Berlin');

CREATE TABLE conditions (
  weather_station_id BIGINT,
  day timestamptz NOT NULL,
  temperature INT NOT null
);

SELECT create_hypertable(
  'conditions', 'day',
  chunk_time_interval => INTERVAL '1 day'
);

INSERT INTO conditions (weather_station_id, day, temperature) VALUES
  (1, '2021-06-14 00:00:00 Europe/Moscow', 26),
  (1, '2021-06-15 00:00:00 Europe/Moscow', 22),
  (1, '2021-06-16 00:00:00 Europe/Moscow', 24),
  (1, '2021-06-17 00:00:00 Europe/Moscow', 24),
  (1, '2021-06-18 00:00:00 Europe/Moscow', 27),
  (1, '2021-06-19 00:00:00 Europe/Moscow', 28),
  (1, '2021-06-20 00:00:00 Europe/Moscow', 30),
  (2, '2021-06-21 00:00:00 Europe/Berlin', 31),
  (2, '2021-06-22 00:00:00 Europe/Berlin', 34),
  (2, '2021-06-23 00:00:00 Europe/Berlin', 34),
  (2, '2021-06-24 00:00:00 Europe/Berlin', 34),
  (2, '2021-06-25 00:00:00 Europe/Berlin', 32),
  (2, '2021-06-26 00:00:00 Europe/Berlin', 32),
  (2, '2021-06-27 00:00:00 Europe/Berlin', 31);
  
CREATE MATERIALIZED VIEW conditions_monthly
WITH (timescaledb.continuous, timescaledb.materialized_only=true) AS
SELECT 
   timescaledb_experimental.time_bucket_ng('1 month', day, weather_stations.tz) AS bucket,
   MIN(temperature),
   MAX(temperature)
FROM conditions
JOIN weather_stations ON conditions.weather_station_id = weather_stations.id
GROUP BY bucket; 

=> SQL Error [0A000]: ERROR: only one hypertable allowed in continuous aggregate view

(Although not included in the example SQL, origin and offset can vary per time series as well.)

The error is not unexpected, as the documentation states that JOINs are not possible in Continuous Aggregates: https://docs.timescale.com/api/latest/continuous-aggregates/create_materialized_view/#create-materialized-view-continuous-aggregate

Only a single hypertable can be specified in the FROM clause of the SELECT query. You cannot include more hypertables, joins, tables, views, or subqueries.

However, this limitation seriously limits the usefulness of timezone-aware Continuous Aggregates. Being able to specify a different time zone per time series is essential for us and certainly for a lot of other international TimescaleDB users, too.

The solutions I could think of are not satisfying:

  • A) Copying the timezone, origin, and offset to each and every row in the hypertable -> that would be a lot of unnecessary duplicate data
  • B) Manage a separate Continuous Aggregates for every combination of timezone, origin, and offset -> introduces a lot of unnecessary complexity

Therefore, to implement international use cases, it would be best if a JOIN with at least one non-hypertable would be allowed.

Implementation challenges

Possibly (unsure):

  • How to do joins efficiently on distributed environments
    • Maybe very simple joins with only one table are allowed to fetch time series meta data?

maxhertrampf avatar Feb 11 '22 13:02 maxhertrampf

Hi! Thanks for the feedback and suggestion. This is already under development and we expect it to be delivered in 2.7.0

Please see https://github.com/timescale/timescaledb/pull/4045

The timezones support is expected to be delivered in 2.6.0. This is already merged to the master branch https://github.com/timescale/timescaledb/commit/ae02934c6a43524b6c17b562b283bac08a2bf4be

afiskon avatar Feb 11 '22 14:02 afiskon

This issue partially duplicates https://github.com/timescale/timescaledb/issues/414 so maybe we will close it.

afiskon avatar Feb 11 '22 14:02 afiskon

Thanks for your feedback!

This issue is certainly not a duplicate of #414. It can be viewed as an extension.

I know that a general support for timezones is coming, but as far as I know it is not possible to use a different timezone for each time series in a hypertable, as that would require JOINs with other tables. See the detailed "Background" description.

(EDIT: I updated the description above to make that clearer)

Or did I misunderstand something here?

maxhertrampf avatar Feb 11 '22 14:02 maxhertrampf

Is there any other way to do this?

If I understand your case correctly, you don't need JOINs in CAGG definitions for this. Here is an example of how this can be solved in the upcoming (probably next week) 2.6.0 release:

CREATE TABLE conditions_tz(
  day timestamptz NOT NULL,
  city text NOT NULL,
  temperature INT NOT NULL);

SELECT create_hypertable(
  'conditions_tz', 'day',
  chunk_time_interval => INTERVAL '1 day'
);

INSERT INTO conditions_tz (day, city, temperature) VALUES
  ('2021-06-14 00:00:00 Europe/Moscow', 'Moscow', 26),
  ('2021-06-15 00:00:00 Europe/Moscow', 'Moscow', 22),
  ('2021-06-16 00:00:00 Europe/Moscow', 'Moscow', 24),
  ('2021-06-17 00:00:00 Europe/Moscow', 'Moscow', 24),
  ('2021-06-18 00:00:00 Europe/Moscow', 'Moscow', 27),
  ('2021-06-19 00:00:00 Europe/Moscow', 'Moscow', 28),
  ('2021-06-20 00:00:00 Europe/Moscow', 'Moscow', 30),
  ('2021-06-21 00:00:00 Europe/Berlin', 'Berlin', 31),
  ('2021-06-22 00:00:00 Europe/Berlin', 'Berlin', 34),
  ('2021-06-23 00:00:00 Europe/Berlin', 'Berlin', 34),
  ('2021-06-24 00:00:00 Europe/Berlin', 'Berlin', 34),
  ('2021-06-25 00:00:00 Europe/Berlin', 'Berlin', 32),
  ('2021-06-26 00:00:00 Europe/Berlin', 'Berlin', 32),
  ('2021-06-27 00:00:00 Europe/Berlin', 'Berlin', 31);

CREATE MATERIALIZED VIEW conditions_moscow
WITH (timescaledb.continuous, timescaledb.materialized_only=true) AS
SELECT 
   timescaledb_experimental.time_bucket_ng('1 month', day, 'Europe/Moscow') AS bucket,
   MIN(temperature),
   MAX(temperature)
FROM conditions_tz
WHERE city = 'Moscow'
GROUP BY city, bucket;

CREATE MATERIALIZED VIEW conditions_berlin
WITH (timescaledb.continuous, timescaledb.materialized_only=true) AS
SELECT 
   timescaledb_experimental.time_bucket_ng('1 month', day, 'Europe/Berlin') AS bucket,
   MIN(temperature),
   MAX(temperature)
FROM conditions_tz
WHERE city = 'Berlin'
GROUP BY city, bucket;

=# SELECT bucket at time zone 'Europe/Moscow' as tstamp, min, max FROM conditions_moscow;
      tstamp         | min | max
---------------------+-----+-----
 2021-06-01 00:00:00 |  22 |  30
(1 row)

=# SELECT bucket at time zone 'Europe/Berlin' as tstamp, min, max FROM conditions_berlin;
      tstamp         | min | max
---------------------+-----+-----
 2021-06-01 00:00:00 |  31 |  34

If this is still not something you want could you please update the issue and explicitly request JOINs in CAGGs definition and probably add a usage example in terms of database schema, SQL queries, and expected results?

afiskon avatar Feb 12 '22 15:02 afiskon

If this is still not something you want could you please update the issue and explicitly request JOINs in CAGGs definition and probably add a usage example in terms of database schema, SQL queries, and expected results?

Unfortunately, your example is not acceptable for our use case, since managing separate continuous aggregates for each combination of tz, origin, and offset is too complex -- also when querying the data.

Thank you for changing the title. I added an SQL example and revised the Enhancement description.

maxhertrampf avatar Feb 14 '22 09:02 maxhertrampf

Sounds this may still be a duplicate, but of #1446 :-)

phemmer avatar Feb 15 '22 17:02 phemmer

It is indeed similar in regards to that it also wants to allow to use multiple tables in a Continuous Aggregate.

However, that issue requests using multiple hypertables as the source for one Continuous Aggregate.

IMO, this is different from allowing a JOIN with a non-hypertable that has time series meta information, which is necessary for timezone-aware Continuous Aggregates.

That said, as they are related, they should be linked, if that is possible.

maxhertrampf avatar Feb 16 '22 08:02 maxhertrampf

I think you're misreading the issue, if you want multiple hypertables, that's #3314.

phemmer avatar Feb 16 '22 12:02 phemmer

I'm sad to see this issue is no longer planned for Q3.

Did more important isssues get in the way or is it put on ice indefinitively?

Because, as I said, without this enhancement, multi-timezone use cases are practically not possible.

maxhertrampf avatar Jul 25 '22 20:07 maxhertrampf

Hi! Thanks for the feedback and suggestion. This is already under development and we expect it to be delivered in 2.7.0

Please see #4045

The timezones support is expected to be delivered in 2.6.0. This is already merged to the master branch ae02934

Bummed this didn't come out with 2.7.0! We are kind of in a lurch as well as we need to do a self-join of a single hypertable + join some vanilla postgres metadata information.

DavisWeaver avatar Aug 24 '22 19:08 DavisWeaver

Would seem that the priority approach is to JOIN cagg in the query but it significantly limits how far you can push these cagg setups, i.e. what's the point if I can't neither cross join, nor apply the window functions then I have no way of specialising the cagg far enough for it to be also lightweight enough so I can create them in droves not different from how you would do prepared statements!

tucnak avatar Oct 29 '22 17:10 tucnak

Thank you very much, @RafiaSabih, for implementing JOINs in Continuous Aggregates (https://github.com/timescale/timescaledb/commit/a67b90e977194f3e55c93ed6b3f5d2a671d503c1)

If I read it right, this should fix this issue as well, correct?

maxhertrampf avatar Jan 25 '23 07:01 maxhertrampf

Hi,

I am also interested in the dynamic timezone in time_bucket function inside continuous aggregates.

I was trying nightly build for 2.10.0-dev version which had joins enabled inside aggregates, but passing timezone column to time_bucket leads to errors:

CREATE MATERIALIZED VIEW values_daily
    WITH (timescaledb.continuous) AS
    SELECT
        v.point_id,
        time_bucket('1 day', v.timestamp, timezone => p.timezone) AS bucket,
        SUM(value) AS sum
    FROM values v, points p
    WHERE p.point_id = v.point_id
    GROUP BY v.point_id, bucket;
ERROR:  only immutable expressions allowed in time bucket function
HINT:  Use an immutable expression as third argument to the time bucket function.

czterocyty avatar Feb 09 '23 22:02 czterocyty

@maxhertrampf I suspect that pull request you mentioned does cover your use case nicely. I would like, however, to bump this issue still, since the feature is lacking a bit. There are very valid comments by @mkindahl and @konskov to @RafiaSabih in there that flew under the radar. Namely about the fact that left join is not supported without a reason given being part of the design doc.

Our use case that would require that: Basically each row has a M:N relation to a tag and in the aggregated view we would like for the rows to have a column with count of a specific tag among other stuff. That does not work now since without left join we are missing all the rows that have 0 tags assigned.

artu-ole avatar Apr 25 '23 13:04 artu-ole

It seems this issue has strayed into tangents but I for one would love/need(!) to see the limitations on CAGGs relaxed to allow hypertable joins in the CAGG definition or lateral joins (e.g. unnest an array) or CTEs or something that allows us to pull in fields that make sense in a group-by clause. I know that a regular table can be joined but sadly our use-case doesn't fit that pattern. Either a lateral join on an unnested array or a join to another hypertable (other than the source) as part of the definition fits our model.

jvanns avatar Jul 04 '23 08:07 jvanns

@artu-ole, I'm having the same issue as @maxhertrampf. The pull request doesn't solve the issue of dynamic timezones because as @czterocyty said, the time_bucket function doesn't allow for a mutable timezone

mummyhen avatar Jul 04 '23 09:07 mummyhen