tsibble icon indicating copy to clipboard operation
tsibble copied to clipboard

index support for {lubridate} `start - end` interval

Open cregouby opened this issue 3 years ago • 4 comments

Brief description of the problem *_gaps are a very usefull set of commands. And I would love to be able to use them on irregular longitudinal data with start and stop index columns, like tsibbledata::nyc_bikes.

Currently, irregular interval tsibble are not supported by *_gaps function family

What output is expected

This is a manual edit :

nyc_bikes_dual_index <- build_tsibble(tsibbledata::nyc_bikes, key = bike_id, index=start_time, index2=stop_time)
scan_gaps(nyc_bikes_dual_index) %>% head(3)

# A tsibble: 4,258 x 12 [0.0149047991726547µs] <America/New_York>
# Key:       bike_id [10]
   bike_id start_time          stop_time           
   <fct>   <dttm>              <dttm>              
 1 26301    2018-02-26 19:15:40 2018-02-27 07:52:49
 2 26301    2018-02-27 07:58:13 2018-02-27 12:03:27
 3 26301    2018-02-27 12:04:54 2018-02-27 13:53:51

cregouby avatar Dec 02 '20 14:12 cregouby

*_gaps() don't know how to handle irregular temporal data. {tsibble} doesn't support dual index. index2 means temporary grouping. Can you please elaborate on what outcome is expected for an irregular tsibble?

earowang avatar Jan 07 '21 23:01 earowang

Hello @earowang, Sure *_gaps() don't know how to handle irregular temporal data, and this is the reason for my feature request here. And sorry for my mistake in using index2 as a secondary index for the example.

Irregular time series with start-stop / duration are very often used in process control, industrial robot monitoring, ... So detecting gaps can be crucial on those datasets. As tsibble is a fantastic framework, I would love to have it extended to irregular time-series with start-stop, as it is a generalisation of the tsibble interval in the calculation of gaps.

Expected outcome is, like in the provided example with nyc_bikes, all start-stop interval where data is missing. This allow usage ratio, efficiency valuation, communication loss detection, ... nyc_bikes here is a toy example, as usual stat-stop data have contiguous time intervals.

cregouby avatar Jan 09 '21 15:01 cregouby

The concept of tsibble's interval is different from {lubridate} start-end Interval. tsibble's interval is more of time differences between time indices, but lubridate's Interval defines specific start-end timestamps. A regular time series means a constant difference is assumed across all time indices, and *_gaps() therefore.

Using nyc_bikes as an example, the time index can be represented in lubridate's Interval class. I'm not sure what's the time difference here for x. Should it be 1 second by aligning start or end for these two observations? Or should it be 1 second from the difference between start and end? In the field you work on, what's the common practice?

library(lubridate)
obs1 <- interval(ymd_hms("2018-02-26 19:15:40"), ymd_hms("2018-02-27 07:52:49"))
obs2 <- interval(ymd_hms("2018-02-27 07:58:13"), ymd_hms("2018-02-27 12:03:27"))
x <- c(obs1, obs2)
x
#> [1] 2018-02-26 19:15:40 UTC--2018-02-27 07:52:49 UTC
#> [2] 2018-02-27 07:58:13 UTC--2018-02-27 12:03:27 UTC

Created on 2021-09-29 by the reprex package (v2.0.1)

earowang avatar Sep 28 '21 21:09 earowang

I fully agree of the missleading use of term interval here, so let's call it the min_gap, that could be the minimum time between end-time of event n and start-time of event n+1 to consider it been a gap.

In the field I work on, the alignment between start-time and next end-time depends on each situation, and is usually linked to the source timestamp precision. ( This is not a very usefull statement, I know) We can only assume that the gap cannot be raised for less than min_gap<1 (of the last digit of the time precision, i.e. no less than one seconds for ymd_hms() timestamps) The two main use-cases are

  • single agent/machine sequence of action ( like of a single log file ) where there is an implicit min_gap of 1 second. But we could think of a parameter to force that min_gap to be higher, when the times are known to be the result of rounding operation like "closest 10 second step" on slow agents/machines.
  • aggregation of agents/machines sequence of actions ( like a syslog server aggregated file) where the implicit min_gap is strictly 1, but may be relaxed to a some higher value to allow some clock offset between agents,...

That makes me think of a configurable min_gap parameter in the *_gaps functions for irregular start-end timeseries...

cregouby avatar Nov 19 '21 10:11 cregouby