timescaledb-toolkit
timescaledb-toolkit copied to clipboard
interpolated_duration_in: No duration when signal starts and ends outside current bucket
Relevant system information:
- PostgreSQL version (output of
SELECT version();
): 14.4 - TimescaleDB Toolkit version (output of
\dx timescaledb_toolkit
inpsql
): 1.8.0 - Installation method: docker
Describe the bug The function omits a result for a bucket if the signal starts and ends outside the bucket.
To Reproduce
create table sensor_data (sensor_id integer, ts timestamp, val integer);
select create_hypertable('sensor_data', 'ts');
insert into sensor_data (sensor_id, ts, val) values
(42, '2022-07-18 08:00:00', 0),
(42, '2022-07-18 09:00:00', 1),
(42, '2022-07-18 10:00:00', 0),
(42, '2022-07-20 10:00:00', 1),
(42, '2022-07-20 11:00:00', 0),
(42, '2022-07-21 11:00:00', 0)
-- Does work but misses data for 2022-07-19
SELECT
ts,
toolkit_experimental.interpolated_duration_in(
'0',
agg,
ts,
'1 day',
LAG(agg) OVER (ORDER BY ts),
LEAD(agg) OVER (ORDER BY ts)
) zero_duration
,toolkit_experimental.interpolated_duration_in(
'1',
agg,
ts,
'1 day',
LAG(agg) OVER (ORDER BY ts),
LEAD(agg) OVER (ORDER BY ts)
) one_duration
FROM (
SELECT
time_bucket('1 day', ts) as ts,
toolkit_experimental.state_agg(ts, val::text) as agg
FROM
sensor_data
GROUP BY time_bucket('1 day', ts)
) s;
Expected behavior Results should be included for each bucket. If a signal starts and end outside a bucket i would expect the length of the bucket as the signal duration
Actual behavior A result for the bucket is not computed. result from above query:
"ts" "zero_duration" "one_duration"
"2022-07-18 00:00:00" "15:00:00" "01:00:00"
"2022-07-20 00:00:00" "23:00:00" "01:00:00"
"2022-07-21 00:00:00" "11:00:00" "00:00:00"
If you use time_bucket_gapfill
instead of time_bucket
the function produces an error:
ERROR: when interpolating data between grouped data, all groups must contain some data
CONTEXT: extension/src/state_aggregate.rs:365:17
SQL state: XX000
Additional context @davidkohn88 had these thoughts in the Slack thread (https://timescaledb.slack.com/archives/C4GT3N90X/p1658511102597979?thread_ts=1658320671.732799&cid=C4GT3N90X):
It sounds like we need a couple of things to support this though:
- A way to get the previous and next non-null values, which lead and lag can't really do (they can look outside the period, but that's a different question and I can show you that separately).
- A way to populate data for null values populated by gapfill so that we can just use the lead and lag to populate that.
There are a couple of problems that we're dealing with here.
First, we only have state_agg objects for buckets where we have data. In the example above there's no data for 7/19, hence there's no object to query for that day's state.
Second, the window functions only lets us see the previous and next state_aggs. This means if we did somehow fill the gaps with some placeholder object, say by using time_bucket_gapfill, we still wouldn't have enough information to make the interpolated call if the gap was more than one day.
I see two ways we can proceed with solving this, but unfortunately none of them are trivial. I think the easiest approach would be to provide some mechanism that could perform gapfilling on the raw data (before bucketing). We can kind of implement this for this example by using time_bucket_gapfill
with last
. However this isn't a great solution as it won't work at for functions which need linear interpolation, and also might be problematic if the data has points at midnight. For this approach we really should implement a more functional gapfill
function.
WITH gapfilled_midnight AS (
SELECT
ts + '1 day' as ts, val
FROM (
SELECT
time_bucket_gapfill('1 day', ts) as ts,
locf(last(val, ts)) as val
FROM sensor_data
WHERE ts BETWEEN '2022-07-18'::timestamptz AND '2022-07-20'::timestamptz
GROUP by time_bucket_gapfill('1 day', ts)
) s
)
SELECT
...
FROM (
SELECT
time_bucket('1 day', ts) as ts,
toolkit_experimental.state_agg(ts, val::text) as agg
FROM
sensor_data NATURAL FULL JOIN gapfilled_midnight
GROUP BY time_bucket('1 day', ts)
) s;
A second approach would be to enhance our timevectors to be able to handle these operations. Unfortunately we're a very long ways from being able to support this and there are a lot of intermediate steps we'd have to pass through. Ultimately I could see us supporting something like the following, but I'm guessing we'd need at least three quarters of timevector development to get to this point:
SELECT
timevector(ts, val::text)
-> timebucket('1 day')
-> interpolated_state_agg()
-> fork(
duration_in('0'),
duration_in('1')
) -> unnest() AS (ts, zero_duration, one_duration)
FROM sensor_data;