fable
fable copied to clipboard
Add support for model based outlier detection
The signature that I'm imagining for this function is: outliers(model, data, level, ...)
Which returns a tibble containing the rows from data which are classified as outliers from model at a given level of confidence.
A default method is also defined which uses quantiles of residuals.
One other possibility for the output of something that would identify_outliers() would be a tsibble with a new column called .outlier (or a vector elsewhere in the object), which could be as simple as a logical, but could also hold a statistic about the degree to which it is an outlier. This would allow for easy plotting of outliers with autoplot() and the like.
This would also allow for a function that would smooth_outliers(), which could be anything from an implementation of forecast::tsclean() up through implementing all the methods from imputeTS::na_interpolation(). If this created, for example, a new column called .imputed, then you could autoplot() the series, showing the original and altered series, for easy comparison. You could also easily compare methods of identifying outliers, etc.
tsbl %>%
identify_outliers(ARIMA(value), .98) %>%
smooth_outliers("StructTS") %>%
model(THETA(value))
I second @davidtedfordholt, having all the data "in-line" is very nice to work with, even if less memory efficient (maybe?). The format OP proposes can be then easily derived using a filter.
We could of course also just extend the tsibble using a left join or something but it feels a bit more "annoying". I like to pipe data through steps and "enrich" it on the way, such that returning a sub-set would make this impossible/difficult.
It's also simple enough to have a smooth_outliers function called on a tsibble that hasn't got an .outlier variable call a function that simultaneously identifies and imputes a replacement value, in order to address use cases where efficiency is paramount.
I still like the idea of having outliers() being a function which returns a tibble of outlying observations row numbers (or perhaps better, a tsibble of the outliers themselves).
Another higher level function like smooth_outliers() (or possibly replace_outliers()?) can then build upon outliers(), model(), and interpolate().
Much like how outliers will be determined with a model-based approach, the way in which they're replaced should also be done via a model specification. I would prefer StructTS(y~...) rather than "StructTS", where StructTS(y~...) is a model specification much like ARIMA(y~...).
It seems beyond the scope of this to consider outlier time series within a larger population of time series. Are we interested in handling both point and subsequence outliers?
Trying to get the idea solidly in my head. outliers() would follow model(), take in a threshold specification, then returning a subset of the input tsbl containing rows with a abs(.resid) > threshold?
I think a part of my struggle with the output being either the row numbers or the outliers by themselves, rather than an augmented tsbl, is that we would then need to feed both the output of outliers() and the original tsbl into something to replace them. I realize that I'm thinking more in terms of EDA than production.
Here's what I'm thinking. Once we've looked at the data and determined that we need to examine outliers, we plot them. If outliers() outputs a subset tsbl, we have to feed the original object back in.
tsbl %>%
model(ARIMA(value ~ trend())) %>%
outliers(~ std_dev(5.4)) %>% # we know the response from the model
autoplot(tsbl)
If we want to see the band represented by the threshold, we end up needing to feed autoplot() the mbl, the output of outliers() and the specification details of the call to outliers().
mbl <- model(tsbl, ARIMA(value ~ trend()))
tsbl_outlyr <- outliers(mbl, ~ std_dev(5.4)) # I saved 1 character and made it look xtrēm
autoplot(tsbl_outlyr, mbl, ~ std_dev(5.4))
If we want to look at a couple of different methods or different thresholds, we're saving objects left and right, and autoplot() is lost to us.
If, on the other hand, we output an augmented version of the original tsbl, we can autoplot(). You can create an additional key for the .detect_method if there are multiple methods, which allows you to facet them to compare. You can also look at what the series look(s) like after you run interpolate(), (which can be made to treat as NA any value marked as an outlier, or just write a wrapper called replace_outliers() or whatever, which does so). You could interpolate using multiple models, based on multiple detection methods, and they would all be available for comparison.
I can't come up with a place where it seems more useful to have a subset of the original tsbl than it would be to have an augmented one. I feel like I'm missing something. That said, I think I might also be trying to protect the excellent name outliers() from being used for a function that I believe has limited use (subsetting), rather than it being the centerpiece. That could mean it identifies outliers for piped use (outputting an augmented tsbl), or as a function that can take both a threshold and a replacement formula (calling interpolate()), returning a tsbl of the same dimensions as the original, but with new values for outliers.
FYI the outliers() generic was added to {fabletools} in https://github.com/tidyverts/fabletools/commit/e6631da45c46e0c63c01461b8201417c8c14fb98
There is an outliers method for the feasts::X_13ARIMA_SEATS() model, and I hope other models will be supported. There will likely be a residual based outlier detection fallback method, similar to what has been described in this thread.
I think maybe following recipes structure may be beneficial, I have created a package to implement outlier detection as a step, tidy.outliers