caret
caret copied to clipboard
Timeslice with longitudinal data
I am new to caret and have a beginner's question regarding the 'timeslice' argument in caret's 'train' function.
I originally have a balanced panel data set with 22 years and 37,442 unique cross-sectional observations. Here is an example data set to exemplify the structure of the data
dat <- data.frame( id = sort( rep( c( "A", "B", "C" ), 22 )),
t = rep( 2000:2021, 3 ),
y = round( runif( 66, 10, 200 ), 0 ),
x1 = rnorm( 66 ),
x2 = rbinom( 66, 3, 0.3 ))
I tried to use 'train' to run a simple random forest model on the data with a fixed time window of 5 years and a horizon of 2 years:
library( caret )
library( ranger )
model <- train(
y ~ .,
tuneLength = 5,
data = dat,
method = "ranger",
trControl = trainControl(
method = "timeslice",
initialWindow = 5,
horizon = 2,
allowParallel = TRUE,
verboseIter = TRUE,
seeds = NULL
),
metric = "RMSE"
)
However, this gives the following error:
Error in sample.int(n = 1000000L, size = num_rs * nrow(trainInfo$loop) + :
cannot take a sample larger than the population when 'replace = FALSE'
I presume this error occurs because the data is not a time series but a longitudinal data set. So my question is how this can be handled with 'timeslice'?
You can define your target variable (y) and predictors (x) separately.
dat <- as.matrix(dat)
drop.col = -c(3)
model <- train(
y = dat$y,
x = dat[, drop.col]
tuneLength = 5,
data = dat,
method = "ranger",
trControl = trainControl(
method = "timeslice",
initialWindow = 5,
horizon = 2,
allowParallel = TRUE,
verboseIter = TRUE,
seeds = NULL
),
metric = "RMSE"
)
NOTE: id needs to be encoded as factor using e.g., factor()