rstanarm
rstanarm copied to clipboard
`posterior_survfit` should return covariate data alongside the predictions
Summary:
For stan_surv
models the id
column in the data frame returned by posterior_survfit()
represents the row of the covariate data that the predictions "belong" to. But the id
column can be confusing, especially when the name id
was used for one of the variables (e.g. the grouping factor) in the model itself.
The more explicit thing might be to return all of the covariate data used in the predictions as part of the data returned by posterior_survfit
; that way the user knows exactly the covariate values that the row of predictions belongs to. Rather than having to infer it based on some "id" (i.e. row identifier) column.
This is essentially what tripped up one user here.
Reproducible Steps:
Here we fit a model with a grouping factor called id
with values 5:10
, then we predict survival probabilities for the groups that had id
values 5
and 6
library(rstanarm)
data <- data.frame(
trt = rep(c("A", "B"), each = 15),
id = rep(5:10, each = 5),
status = sample(0:1, 30, replace = TRUE),
days = sample(10, 30, replace = TRUE))
mod = stan_surv(
formula = Surv(days, status) ~ trt + (1 | id),
data = data,
basehaz = "weibull",
chains = 1,
iter = 20,
cores = 1)
posterior_survfit(
mod,
type = "surv",
newdata = data[1:10,],
times = 1,
extrapolate = FALSE)
which returns:
stan_surv predictions
num. individuals: 10
prediction type: event free probability
standardised?: no
conditional?: no
id cond_time time median ci_lb ci_ub
1 1 NA 1.0000 0.9876 0.9716 0.9950
2 2 NA 1.0000 0.9876 0.9716 0.9950
3 3 NA 1.0000 0.9876 0.9716 0.9950
4 4 NA 1.0000 0.9876 0.9716 0.9950
5 5 NA 1.0000 0.9876 0.9716 0.9950
6 6 NA 1.0000 0.9852 0.9684 0.9936
7 7 NA 1.0000 0.9852 0.9684 0.9936
8 8 NA 1.0000 0.9852 0.9684 0.9936
9 9 NA 1.0000 0.9852 0.9684 0.9936
10 10 NA 1.0000 0.9852 0.9684 0.9936
but we see that the id
column in the predictions has values 1
through 10
. This is because it is referring to rows 1 through 10 of the prediction covariate data, not the values for the id
variable in the original model! It would be clearer / safer if we just returned all the covariates, including id
, and then we wouldn't even need a row identifier.
Yeah I see what you mean. I'm not opposed to the solution that you propose but another option could be to switch to using the column name .id
instead of id
to indicate that it's a metadata column (this is similar to what we're doing in the posterior package with column names like .chain
instead of chain
). Or maybe the name .row_id
would be even less likely to be confused with a variable in the data.