timescaledb-toolkit
timescaledb-toolkit copied to clipboard
Missing values / Gap query
Is your feature request related to a problem? Please describe.
In the renewable energy area, we want to be conservative on missing values. Meaning, if a turbine didn't send any production for a certain period of time, we want to keep track of it.
This can lead to either missing indexes in the db, or indexes for which a column is set to Null or NaN.
It would be of interest to have a query to retrieve those gaps.
The completeness of the timeseries might then be deduced, with the definition being the following ratio:
(nrecords with value) / (nrecords with value + nrecords missing)
A wind turbine production with a 0.95 completness means 5% of the time it was in maintenance / stopped for any other reason.
Describe the solution you'd like
-- Sample data: mind the gaps in the timestamps
SELECT * FROM tsdata.power
+------------------------+------------+
| ts | power_kw |
|------------------------+------------|
| 2019-12-31 23:00:00+00 | 743.0 |
| 2019-12-31 23:02:00+00 | 712.0 |
| 2019-12-31 23:03:00+00 | 696.0 |
| 2019-12-31 23:06:00+00 | 769.0 |
| 2019-12-31 23:07:00+00 | 840.0 |
| 2019-12-31 23:08:00+00 | 832.0 |
| 2019-12-31 23:09:00+00 | 850.0 |
| 2019-12-31 23:15:00+00 | 892.0 |
+------------------------+------------+
-- Select gaps that last 5 minutes
SELECT time_bucket_gaps('5 minute', ts, power_kw) FROM tsdata.power
+------------------------+
| time_bucket_gaps |
|------------------------+
| 2019-12-31 23:10:00+00 |
+------------------------+
-- Select gaps that last 1 minutes
SELECT time_bucket_gaps('1 minute', ts, power_kw) FROM tsdata.power
+------------------------+
| time_bucket_gaps |
|------------------------+
| 2019-12-31 23:01:00+00 |
| 2019-12-31 23:04:00+00 |
| 2019-12-31 23:05:00+00 |
| 2019-12-31 23:10:00+00 |
| 2019-12-31 23:11:00+00 |
| 2019-12-31 23:12:00+00 |
| 2019-12-31 23:13:00+00 |
| 2019-12-31 23:14:00+00 |
+------------------------+
-- Select gaps that last 2 minutes
SELECT time_bucket_gaps('2 minute', ts, power_kw) FROM tsdata.power
+------------------------+
| time_bucket_gaps |
|------------------------+
| 2019-12-31 23:04:00+00 |
| 2019-12-31 23:10:00+00 |
| 2019-12-31 23:12:00+00 |
+------------------------+
Additional context
Often, the wind turbine captor is sending a value only when it measures a different value compared to previously, with a minimum interval of one minute.
This lead to unevenly-spaced timeseries:
power_kW
2020-01-01 00:00:00+01:00 743.0
2020-01-01 00:02:00+01:00 712.0
2020-01-01 00:03:00+01:00 696.0
2020-01-01 00:06:00+01:00 769.0
2020-01-01 00:07:00+01:00 840.0
2020-01-01 00:08:00+01:00 832.0
2020-01-01 00:09:00+01:00 850.0
2020-01-01 00:15:00+01:00 892.0
Or viewing it with NaNs:
power_kW
2020-01-01 00:00:00+01:00 743.0
2020-01-01 00:01:00+01:00 NaN
2020-01-01 00:02:00+01:00 712.0
2020-01-01 00:03:00+01:00 696.0
2020-01-01 00:04:00+01:00 NaN
2020-01-01 00:05:00+01:00 NaN
2020-01-01 00:06:00+01:00 769.0
2020-01-01 00:07:00+01:00 840.0
2020-01-01 00:08:00+01:00 832.0
2020-01-01 00:09:00+01:00 850.0
2020-01-01 00:10:00+01:00 NaN
2020-01-01 00:11:00+01:00 NaN
2020-01-01 00:12:00+01:00 NaN
2020-01-01 00:13:00+01:00 NaN
2020-01-01 00:14:00+01:00 NaN
2020-01-01 00:15:00+01:00 892.0
Hence the sample data used previously.
Hi! Have you looked at https://docs.timescale.com/latest/api#time_bucket_gapfill?
I think time_bucket_gapfill in a subquery followed by a count(*) filter (where power is null) + count(*) should give you the ratios you want?
Though I suppose that wouldn't work with continuous aggregates, which may be why you're asking more specifically about it.
I was looking for something like this as well. Can be solved by this query:
select date(b), count(*) filter (where a.count is null) from
(select
time_bucket_gapfill('1m', bucket) as b, count(*) as count
from series_aggregegated_by_1m
where bucket between '2024-06-01 00:00:00' and now() and sensor_type = ... group by b)
a group by date(b);
This one groups by day, but you can adjust it as you want.