BOOM
BOOM copied to clipboard
predict method does not use old.data correctly.
Dear Steve,
I write you from CIG group of the Technical University of Madrid. We
are working in a project where we are trying to apply Bayesian
Structural Time Series to generate a model that identifies a time
series structure.
Our objective is not forcasting from the last known value in the
training data, instead we aim to predict using other starting evidence
with previous time stamps (that can be equal to the ones of the
training data).
We have some sample of the same time series in the same timestamp, we
use this data to train the model and then we would evaluate the
results of the prediction over each sample, so that we are using the
predict function with the parameter olddata defined, but we have some
problems:
-
The predict function doesn´t use olddata without regression component We have included lags of the time series as regressor variables to solve that. We are working now with a model that has regressor variables, so we
add the newdata parameter to the predict function. The function gives
us a solution using the last values as first data, but we want to
introduce information of previous data to force a solution focus on
one of the sample of the time series. -
When we combine olddata with new data to get a prediction, the
function shows the next error:
Error: number.of.time.points > 0 is not TRUE In addition: Warning message: In is.na(observed.data$response) : is.na() applied to non-(list or vector) of type 'NULL'
We have debugged the code and it seems like the predict function
removes the response variable on the .FormatObservedDataForPredictions
step. Looking into this function, the function .ExtractResponse
doesn´t get the response variable name from the formula parameter.
We are not sure if this is a code problem or that we don´t know how to
use the olddata parameter properly because we haven´t found any example.
Thank you for your attention, Gabriel and David
First problem
The olddata parameter don´t combine with olddata.tiemstamp parameter, and don´t change enough the results.
Get data
library(tidyverse, quietly = TRUE)
library(bsts, quietly = TRUE)
library(data.table)
data(iclaims)
.data <- initial.claims
claims <- .data$iclaimsNSA
plot(claims, ylab = "")
Fit model
In this step just fit a bsts model that include Local linear trend and seasonal components.
model_components <- list()
summary(model_components <- AddLocalLinearTrend(model_components,
y = claims))
summary(model_components <- AddSeasonal(model_components, y = claims,
nseasons = 52))
fit <- bsts(claims, model_components, niter = 2000)
Predictions
First test
Error obtained when trying to include the olddata and olddata.timestamps parameters. Also, include de validation of that errors it is not real, number of observations and lenght of timestamp are equal.
pred_wolddata_ts <- try(predict(fit,
horizon = 100,
burn = burnin,
quantiles = c(.05, .95),
olddata = claims[50:80],
olddata.timestamps = index(claims[50:80]),
seed = 2408))
print(paste0("Difference between length of data and timestamp: ",length(index(claims[50:80])) - length(claims[50:80])))
pred_wolddata_ts
Error : number.of.observations == length(timestamps) is not TRUE [1] "Difference between length of data and timestamp: 0" [1] "Error : number.of.observations == length(timestamps) is not TRUE\n" attr(,"class") [1] "try-error" attr(,"condition") <simpleError: number.of.observations == length(timestamps) is not TRUE>
Second test
Second experiment, in this case removing timestamp, just olddata parameter has been defined using the 50:80 observations from training data.
pred_wolddata <- try(predict(fit,
horizon = 100,
burn = burnin,
quantiles = c(.05, .95),
olddata = claims[50:80],
seed = 2408))`
and a second one with training data from 20 to 30 that is a very diferente step in timeseries, we want to validate the effect of alldata in the prediction
pred_wolddata2 <- predict(fit,
horizon = 100,
burn = burnin,
quantiles = c(.05, .95),
olddata = claims[20:30],
seed = 2408)
plot(pred_wolddata2)
But the results are equals:
print(paste0("Acumulated diference between training[20:30] and training[50:80] data:",sum(pred_wolddata$mean - pred_wolddata2$mean)))
[1] "Acumulated diference between training[20:30] and training[50:80] data:0"
And no very different from don´t use olddata:
[1] "Acumulated diference between without olddata prediction and training[20:30] data: 60.96"
Second problem
After show some problems with the olddata parameter, let`s explain the second issue.
It is not possible combine newdata parameter for regression and add the olddata parameter to define the initial situation and values to generate a prediction.
We tried use newdata as a alternative to olddata adding some lags of the objective variable, that contain information from past. This alternative solve in some way the forecast problem for diferents timestamps but return and error in combination with olddata parameter.
claims_dt <- as.data.table(claims)
claims_dt[, x_1 := shift(x,n = 1,type = "lag")]
claims_dt[, x_2 := shift(x,n = 2,type = "lag")]
claims_dt[, x_3 := shift(x,n = 3,type = "lag")]
claims_dt[, x_4 := shift(x,n = 4,type = "lag")]
claims_dt <- claims_dt[!1:4,]
Fit model
model_components <- list()
summary(model_components <- AddLocalLinearTrend(model_components,
y = claims))
summary(model_components <- AddSeasonal(model_components,
y = claims,
nseasons = 52))
fit_reg <- bsts(data = claims_dt[1:400,],
formula = x ~.,
state.specification = model_components,
niter = 2000)
Predictions
First test: Same problem that before issue
pred_w_newolddata_ts <- try(predict(fit,
horizon = 100,
burn = burnin,
quantiles = c(.05, .95),
olddata = claims_dt[50:80,],
olddata.timestamps = index(claims_dt[50:80,]),
newdata = claims_dt[401:452,!"x",with = F],
timestamps = 401:452,
seed = 2408))
pred_w_newolddata_ts
print(paste0("Difference of newdata length and newdata.timestamp length):",length(401:452) - length(claims_dt[401:452,!"x",with = F][[1]])))
print(paste0("Difference of olddata length and olddata.timestamp length):",length(index(claims[50:80])) - length(claims[50:80])))
Error : number.of.observations == length(timestamps) is not TRUE [1] "Error : number.of.observations == length(timestamps) is not TRUE\n" attr(,"class") [1] "try-error" attr(,"condition") <simpleError: number.of.observations == length(timestamps) is not TRUE> [1] "Difference of newdata length and newdata.timestamp length):0" [1] "Difference of olddata length and olddata.timestamp length):0"
So as before remove olddata.timestamp:
pred_w_newolddata <- try(predict(fit_reg,
horizon = 100,
burn = burnin,
quantiles = c(.05, .95),
olddata = claims_dt[20:30,],
newdata = claims_dt[401:452,!"x",with = F],
timestamps = 401:452,
seed = 2408))
pred_w_newolddata
When we combine olddata with new data to get a prediction, the function shows the next error:
Error: number.of.time.points > 0 is not TRUE In addition: Warning message: In is.na(observed.data$response) : is.na() applied to non-(list or vector) of type 'NULL'
We have debugged the code and it seems like the predict function removes the response variable on the .FormatObservedDataForPredictions step. Looking into this function, the function .ExtractResponse doesn´t get the response variable name from the formula parameter. bsts_problems.pdf