tsibble
tsibble copied to clipboard
index support for {lubridate} `start - end` interval
Brief description of the problem
*_gaps
are a very usefull set of commands. And I would love to be able to use them on irregular longitudinal data with start
and stop
index columns, like tsibbledata::nyc_bikes
.
Currently, irregular interval tsibble are not supported by *_gaps
function family
What output is expected
This is a manual edit :
nyc_bikes_dual_index <- build_tsibble(tsibbledata::nyc_bikes, key = bike_id, index=start_time, index2=stop_time)
scan_gaps(nyc_bikes_dual_index) %>% head(3)
# A tsibble: 4,258 x 12 [0.0149047991726547µs] <America/New_York>
# Key: bike_id [10]
bike_id start_time stop_time
<fct> <dttm> <dttm>
1 26301 2018-02-26 19:15:40 2018-02-27 07:52:49
2 26301 2018-02-27 07:58:13 2018-02-27 12:03:27
3 26301 2018-02-27 12:04:54 2018-02-27 13:53:51
*_gaps()
don't know how to handle irregular temporal data. {tsibble} doesn't support dual index. index2
means temporary grouping. Can you please elaborate on what outcome is expected for an irregular tsibble?
Hello @earowang, Sure *_gaps() don't know how to handle irregular temporal data, and this is the reason for my feature request here. And sorry for my mistake in using index2 as a secondary index for the example.
Irregular time series with start-stop / duration are very often used in process control, industrial robot monitoring, ... So detecting gaps can be crucial on those datasets. As tsibble is a fantastic framework, I would love to have it extended to irregular time-series with start-stop, as it is a generalisation of the tsibble interval in the calculation of gaps.
Expected outcome is, like in the provided example with nyc_bikes
, all start-stop interval where data is missing. This allow usage ratio, efficiency valuation, communication loss detection, ... nyc_bikes
here is a toy example, as usual stat-stop data have contiguous time intervals.
The concept of tsibble's interval is different from {lubridate} start-end Interval
. tsibble's interval is more of time differences between time indices, but lubridate's Interval
defines specific start-end timestamps. A regular time series means a constant difference is assumed across all time indices, and *_gaps()
therefore.
Using nyc_bikes
as an example, the time index can be represented in lubridate's Interval
class. I'm not sure what's the time difference here for x
. Should it be 1 second by aligning start
or end
for these two observations? Or should it be 1 second from the difference between start
and end
? In the field you work on, what's the common practice?
library(lubridate)
obs1 <- interval(ymd_hms("2018-02-26 19:15:40"), ymd_hms("2018-02-27 07:52:49"))
obs2 <- interval(ymd_hms("2018-02-27 07:58:13"), ymd_hms("2018-02-27 12:03:27"))
x <- c(obs1, obs2)
x
#> [1] 2018-02-26 19:15:40 UTC--2018-02-27 07:52:49 UTC
#> [2] 2018-02-27 07:58:13 UTC--2018-02-27 12:03:27 UTC
Created on 2021-09-29 by the reprex package (v2.0.1)
I fully agree of the missleading use of term interval here, so let's call it the min_gap
, that could be the minimum time between end-time of event n and start-time of event n+1 to consider it been a gap
.
In the field I work on, the alignment between start-time and next end-time depends on each situation, and is usually linked to the source timestamp precision. ( This is not a very usefull statement, I know) We can only assume that the gap cannot be raised for less than min_gap<1
(of the last digit of the time precision, i.e. no less than one seconds for ymd_hms()
timestamps)
The two main use-cases are
- single agent/machine sequence of action ( like of a single log file ) where there is an implicit
min_gap
of 1 second. But we could think of a parameter to force thatmin_gap
to be higher, when the times are known to be the result of rounding operation like "closest 10 second step" on slow agents/machines. - aggregation of agents/machines sequence of actions ( like a syslog server aggregated file) where the implicit
min_gap
is strictly 1, but may be relaxed to a some higher value to allow some clock offset between agents,...
That makes me think of a configurable min_gap
parameter in the *_gaps
functions for irregular start-end timeseries...