timescaledb-toolkit interpolated_duration_in: No duration when signal starts and ends outside current bucket

interpolated_duration_in: No duration when signal starts and ends outside current bucket

Open arglgruml opened this issue 2 years ago • 1 comments

Relevant system information:

PostgreSQL version (output of SELECT version();): 14.4
TimescaleDB Toolkit version (output of \dx timescaledb_toolkit in psql): 1.8.0
Installation method: docker

Describe the bug The function omits a result for a bucket if the signal starts and ends outside the bucket.

To Reproduce

create table sensor_data (sensor_id integer, ts timestamp, val integer);
select create_hypertable('sensor_data', 'ts');
insert into sensor_data (sensor_id, ts, val) values
(42, '2022-07-18 08:00:00', 0),
(42, '2022-07-18 09:00:00', 1),
(42, '2022-07-18 10:00:00', 0),
(42, '2022-07-20 10:00:00', 1),
(42, '2022-07-20 11:00:00', 0),
(42, '2022-07-21 11:00:00', 0)

-- Does work but misses data for 2022-07-19
SELECT 
  ts,
  toolkit_experimental.interpolated_duration_in(
    '0',
    agg,
    ts,
    '1 day',
    LAG(agg) OVER (ORDER BY ts),
    LEAD(agg) OVER (ORDER BY ts)
  ) zero_duration
  ,toolkit_experimental.interpolated_duration_in(
    '1',
    agg,
    ts,
    '1 day',
    LAG(agg) OVER (ORDER BY ts),
    LEAD(agg) OVER (ORDER BY ts)
  ) one_duration
FROM (
  SELECT
    time_bucket('1 day', ts) as ts,
    toolkit_experimental.state_agg(ts, val::text) as agg
  FROM
    sensor_data
  GROUP BY time_bucket('1 day', ts)
) s;

Expected behavior Results should be included for each bucket. If a signal starts and end outside a bucket i would expect the length of the bucket as the signal duration

Actual behavior A result for the bucket is not computed. result from above query:

"ts"	"zero_duration"	"one_duration"
"2022-07-18 00:00:00"	"15:00:00"	"01:00:00"
"2022-07-20 00:00:00"	"23:00:00"	"01:00:00"
"2022-07-21 00:00:00"	"11:00:00"	"00:00:00"

If you use time_bucket_gapfill instead of time_bucket the function produces an error:

ERROR:  when interpolating data between grouped data, all groups must contain some data
CONTEXT:  extension/src/state_aggregate.rs:365:17
SQL state: XX000

Additional context @davidkohn88 had these thoughts in the Slack thread (https://timescaledb.slack.com/archives/C4GT3N90X/p1658511102597979?thread_ts=1658320671.732799&cid=C4GT3N90X):

It sounds like we need a couple of things to support this though:

A way to get the previous and next non-null values, which lead and lag can't really do (they can look outside the period, but that's a different question and I can show you that separately).

A way to populate data for null values populated by gapfill so that we can just use the lead and lag to populate that.

Jul 25 '22 09:07 arglgruml

There are a couple of problems that we're dealing with here.

First, we only have state_agg objects for buckets where we have data. In the example above there's no data for 7/19, hence there's no object to query for that day's state.

Second, the window functions only lets us see the previous and next state_aggs. This means if we did somehow fill the gaps with some placeholder object, say by using time_bucket_gapfill, we still wouldn't have enough information to make the interpolated call if the gap was more than one day.

I see two ways we can proceed with solving this, but unfortunately none of them are trivial. I think the easiest approach would be to provide some mechanism that could perform gapfilling on the raw data (before bucketing). We can kind of implement this for this example by using time_bucket_gapfill with last. However this isn't a great solution as it won't work at for functions which need linear interpolation, and also might be problematic if the data has points at midnight. For this approach we really should implement a more functional gapfill function.

WITH gapfilled_midnight AS (
  SELECT
    ts + '1 day' as ts, val
  FROM (
    SELECT 
      time_bucket_gapfill('1 day', ts) as ts,
      locf(last(val, ts)) as val
    FROM sensor_data
    WHERE ts BETWEEN '2022-07-18'::timestamptz AND '2022-07-20'::timestamptz
    GROUP by time_bucket_gapfill('1 day', ts)
  ) s
)
SELECT 
  ...
FROM (
  SELECT
    time_bucket('1 day', ts) as ts,
    toolkit_experimental.state_agg(ts, val::text) as agg
  FROM
    sensor_data NATURAL FULL JOIN gapfilled_midnight
  GROUP BY time_bucket('1 day', ts)
) s;

A second approach would be to enhance our timevectors to be able to handle these operations. Unfortunately we're a very long ways from being able to support this and there are a lot of intermediate steps we'd have to pass through. Ultimately I could see us supporting something like the following, but I'm guessing we'd need at least three quarters of timevector development to get to this point:

SELECT
  timevector(ts, val::text)
  -> timebucket('1 day')
  -> interpolated_state_agg()
  -> fork(
    duration_in('0'),
    duration_in('1')
  ) -> unnest() AS (ts, zero_duration, one_duration)
FROM sensor_data;

Jul 27 '22 23:07 WireBaron

timescaledb-toolkit timescaledb-toolkit copied to clipboard

interpolated_duration_in: No duration when signal starts and ends outside current bucket

timescaledb-toolkit
timescaledb-toolkit copied to clipboard