trendbreaker
trendbreaker copied to clipboard
The upper and lower bounds are not (always) properly computed for alpha = 0
For alpha = 0
, the upper and lower bounds should always be Inf
and 0
, respectively.
However:
- For
glm
: trendbreaker usesciTools::add_pi
internally to find upper and lower bounds.- It sometimes finds finite bounds. In my opinion this is a bug of ciTools which trendbreaker should correct.
- This leads to alarms being raised (outliers detected) where none should be.
- Also, it is unclear why the bounds are not exactly the same from
asmodee
andciTools::add_pi
whenglm
is used in the former and both are applied in the same.
- For
lm
: a lower bound of-Inf
is sometimes found, which per se is fine for the assumed normal distribution, but doesn't make sense for counts. trendbreaker should either renounce usinglm
(see https://github.com/reconhub/trendbreaker/issues/34) or set a minimum lower bound of 0.
The correct behavior is shown below with MASS::glm.nb
, where trendbreaker sets bounds explicitly based on the negative binomial.
See https://github.com/reconhub/trendbreaker/R/model-interface.R.
# For `lm`: a lower bound of -Inf is sometimes found, which is per se fine for the assumed normal
# distribution, but doesn't make sense for counts. ASMODEE should either renounce using `lm`
# or set a minimum lower bound of 0.
#
# The correct behavior is shown with `MASS::glm.nb`, where ASMODEE set bounds explicitly based on
# the negative binomial.
#
# Also, `asmodee` is not deterministic.
#
# See trendbreaker/R/model-interface.R
library(tidyr)
library(dplyr)
library(trendbreaker)
library(ggplot2)
save_plots_png <- T
models <- list(
poisson_constant = glm_model(count ~ 1, family='poisson'),
regression = lm_model(count ~ date),
negbin_time = glm_nb_model(count ~ date)
)
# seed = 1 => ASMODEE selects `lm` for Poisson noise
# seed = 10 => ASMODEE selects `glm` for Poisson noise and `glm.nb` for Poisson noise with peak
for (seed in c(1, 10)) {
set.seed(seed)
ts_list <- list(
random = tibble(date=1:42, count=rpois(42, 100)),
peak = tibble(date=1:42, count=rpois(42, 100)+c(rep(0,20), 200, rep(0,21)))
)
for (ts_type in names(ts_list)) {
ts <- ts_list[[ts_type]]
asmodee_res <- asmodee(
ts,
models = models,
alpha = 0,
max_k = 12,
method = evaluate_aic
)
print(paste0(
'ASMODEE with alpha = 0 on constant Poisson noise ',
ifelse(ts_type=='peak','+ peak ',''),
'with seed ', seed, ':'
))
print(asmodee_res$model$model$call)
print('upper:')
print(asmodee_res$results$upper)
print('lower:')
print(asmodee_res$results$lower)
asmodee_plot <- plot(asmodee_res,'date') +
ggtitle(paste0('ASMODEE with alpha = 0 on constant Poisson noise ',
ifelse(ts_type=='peak','+ peak ',''),
'with seed ', seed))
print(asmodee_plot)
if (save_plots_png) {
ggsave(plot=asmodee_plot, filename=paste0('asmodee_res_',seed,'_',ts_type,'.png'),
width=20, height=10, units='cm', dpi=150)
}
cat('\n')
if (as.character(asmodee_res$model$model$call)[1]=='glm') {
ci_pi <- ciTools::add_pi(
tb = ts,
fit = glm(formula = count ~ 1, family = "poisson", data = ts),
alpha = 0,
names = c("lower", "upper")
)
print('ciTools::add_pi for Poisson GLM:')
print('glm(formula = count ~ 1, family = \'poisson\', data = ts)')
print('upper:')
print(ci_pi$upper)
print('lower:')
print(ci_pi$lower)
warn <- warnings()
if (length(warn[[length(warn)]])>0) {
if (warn[[length(warn)]][[1]]=='add_pi.glm') {
print('in ciTools::add_pi.glm:')
print(names(warn)[length(warn)])
}
}
cat('\n')
}
}
}
# Applied to a time series from a "flare-up" scenario
ts <- tibble(
date=1:42,
count=c(2, 2, 2, 2, 1, 1, 2, 2, 2, 0, 1, 0, 0, 0, 1, 0, 1, 1, 2, 1, 1, 0, 1, 2, 1, 2, 2, 2, 0, 0,
2, 3, 2, 1, 0, 1, 1, 0, 2, 3, 0, 7)
)
i_finite <- c()
i_outlier <- c()
model_finite <- c()
plot_outlier <- T
n_trials <- 100
for (j in 1:n_trials) {
asmodee_res <- asmodee(
ts,
models = models,
alpha = 0,
max_k = 12,
method = evaluate_aic
)
ubs <- asmodee_res$results$upper
lbs <- asmodee_res$results$lower
if (any(is.finite(ubs)) | any(lbs!=0)) {
i_finite <- c(i_finite, j)
model_finite <- c(model_finite, as.character(asmodee_res$model$model$call)[1])
}
if (any(asmodee_res$results$outlier)) {
i_outlier <- c(i_outlier, j)
if (plot_outlier) {
asmodee_plot <- plot(asmodee_res,'date') +
ggtitle(paste0('ASMODEE with alpha = 0 on "flare-up" time series, trial ',j))
print(asmodee_plot)
if (save_plots_png) {
ggsave(plot=asmodee_plot, filename=paste0('asmodee_res_flareup_',j,'.png'),
width=20, height=10, units='cm', dpi=150)
}
plot_outlier <- F
}
}
}
print('ASMODEE with alpha = 0 on a \'flare-up\' time series:')
print('Proportion of trials with finite bounds:')
print(paste0(round(100*length(i_finite)/n_trials),'%'))
print('Proportion of trials with outliers:')
print(paste0(round(100*length(i_outlier)/n_trials),'%'))
print('Models leading to finite bounds:')
print(unique(model_finite))
Output:
[1] "ASMODEE with alpha = 0 on constant Poisson noise with seed 1:"
lm(formula = count ~ date, data = data)
[1] "upper:"
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf
[1] "lower:"
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
-Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
-Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf
39 40 41 42
-Inf -Inf -Inf -Inf
[1] "ASMODEE with alpha = 0 on constant Poisson noise + peak with seed 1:"
MASS::glm.nb(formula = count ~ date, data = data, init.theta = 25.08322303,
link = log)
[1] "upper:"
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf
[1] "lower:"
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
32 33 34 35 36 37 38 39 40 41 42
0 0 0 0 0 0 0 0 0 0 0
[1] "ASMODEE with alpha = 0 on constant Poisson noise with seed 10:"
glm(formula = count ~ 1, family = "poisson", data = data)
[1] "upper:"
[1] 133 135 130 142 134 134 136 136 136 135 134 139 135 138 135 135 131 134 134 147 131 132
[23] 131 136 139 132 143 135 136 136 142 135 130 134 135 132 139 136 135 137 138 137
[1] "lower:"
[1] 66 70 61 69 69 68 68 63 65 68 70 65 63 68 62 68 68 67 65 64 63 68 66 65 71 66 69 59 69 67
[31] 69 58 66 67 65 66 66 67 65 65 72 67
[1] "ciTools::add_pi for Poisson GLM:"
[1] "glm(formula = count ~ 1, family = 'poisson', data = ts)"
[1] "upper:"
[1] 129 129 139 133 134 134 140 134 137 138 131 133 134 131 137 135 136 129 135 141 137 135
[23] 134 131 136 138 142 136 139 134 137 140 135 136 138 139 130 133 138 131 136 131
[1] "lower:"
[1] 66 64 67 65 68 65 71 56 65 66 68 71 65 67 68 67 71 65 62 64 64 67 66 58 69 60 69 66 65 67
[31] 64 61 66 66 62 68 64 68 70 67 65 67
[1] "ASMODEE with alpha = 0 on constant Poisson noise + peak with seed 10:"
MASS::glm.nb(formula = count ~ date, data = data, init.theta = 22.41823699,
link = log)
[1] "upper:"
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf Inf
[1] "lower:"
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
32 33 34 35 36 37 38 39 40 41 42
0 0 0 0 0 0 0 0 0 0 0
[1] "ASMODEE with alpha = 0 on a 'flare-up' time series:"
[1] "Proportion of trials with finite bounds:"
[1] "100%"
[1] "Proportion of trials with outliers:"
[1] "26%"
[1] "Models leading to finite bounds:"
[1] "glm"
(The last percentages will vary from run to run, see https://github.com/reconhub/trendbreaker/issues/35.)
Why would one choose an alpha = 0? Maybe we could solve it by requiring alpha > 0
?
Why would one choose an alpha = 0? Maybe we could solve it by requiring
alpha > 0
?
alpha=0
is an admissible and well defined value, with a clear expectation as to what this should lead to.
0 is not an interesting value for practical use. However to me that points to two problems:
- a bug in
ciTools::add_pi
so that I don't know whether I can trust the values it computes foralpha>0
- an inconsistency in the way
alpha
is handled between Poisson and NegBin.
Maybe we need to move away from ciTools
and roll out our own for the different model types? We included ciTools
to have a default method for many model types just to have something.
Currently looking at this in the noci branch.