JMbayes2
JMbayes2 copied to clipboard
Data format question and issue with predict function
Hi,
I had a question regarding the data format required to use JMbayes2, and I also have an issue with the predict function.
Here is an example with a small dataset.
library(JMbayes2) # I am using version 0.5-0
library(dplyr)
train_cox <- structure(list(tstart = c(0, 0, 0.028, 0.306, 0.334, 0.36, 0.388,
0.429, 0.666, 0.684, 0.72, 0.748, 1.387, 1.442, 1.652, 1.706,
1.748, 1.802, 1.963, 2.014, 2.066, 2.108, 2.162, 2.373, 2.426,
0, 0.306, 0.307, 0.36, 0.667, 0.348, 0.486, 0.559, 0.614, 0.698,
0.699, 1.003, 1.058, 1.364, 1.418, 1.727, 1.779, 2.086, 2.139,
2.445, 2.5, 2.543, 2.716, 2.797, 2.86), tstop = c(0.011, 0.028,
0.306, 0.334, 0.36, 0.388, 0.429, 0.666, 0.684, 0.72, 0.748,
0.839, 1.442, 1.652, 1.706, 1.748, 1.802, 1.963, 2.014, 2.066,
2.108, 2.162, 2.373, 2.426, 2.445, 0.306, 0.307, 0.36, 0.667,
0.672, 0.486, 0.559, 0.614, 0.698, 0.699, 1.003, 1.058, 1.364,
1.418, 1.727, 1.779, 2.086, 2.139, 2.445, 2.5, 2.543, 2.716,
2.797, 2.86, 3.035), event = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), var1.surv = c(0L,
0L, 0L, 0L, 1L, 0L, 2L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L,
1L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 2L, 0L, 1L, 0L, 0L, 1L, 0L, 1L,
0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L,
3L), id = c("1", "2", "2", "2", "2", "2", "2", "2", "2", "2",
"2", "2", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3",
"3", "3", "4", "4", "4", "4", "4", "5", "5", "5", "5", "5", "5",
"5", "5", "5", "5", "5", "5", "5", "5", "5", "5", "5", "5", "5",
"5")), class = "data.frame", row.names = c(NA, -50L))
train_mixed <- structure(list(tstart = c(0, 0, 0.033, 0.066, 0.098, 0.13, 0.162,
0.195, 0.228, 0.259, 0.292, 0.487, 0.519, 0.552, 0.585, 0.616,
0.649, 0.682, 0.714, 0.746, 0.779, 0.811, 0.844, 0.876, 0, 0.032,
0.065, 0.098, 0.13, 0.162, 0.195, 0.227, 0.26, 0.292, 0.325,
0.358, 0.39, 0.422, 0.455, 0.487, 0.519, 0.551, 0.584, 0.617,
0.649, 0.681, 0.098, 0.13, 0.163, 0.196, 0.228, 0.26, 0.292,
0.325, 0.358, 0.39, 0.422, 0.455, 0.487, 0.519, 0.552, 0.585,
0.617, 0.649, 0.682, 0.715, 0.747, 0.779, 0.812, 0.845, 0.877,
0.909, 0.941, 0.974, 1.007, 1.038, 1.071), var1.long = c(0, 0,
0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0,
1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1,
0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,
1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1), var2.long = c(0, 0, 0, 0,
0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1,
0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 0, 0, 0, 1, 0), long1 = c(1L, 4L, 2L, 1L, 1L,
3L, 3L, 1L, 0L, 3L, 2L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
3L, 0L, 2L, 4L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 5L, 1L, 2L,
3L, 0L, 0L, 0L, 4L, 2L, 0L, 2L, 2L, 1L, 6L, 1L, 3L, 0L, 1L, 1L,
1L, 3L, 3L, 1L, 2L, 0L, 1L, 5L, 2L, 2L, 1L, 1L, 7L, 3L, 5L, 3L,
3L, 1L, 3L, 6L, 1L, 1L, 3L, 0L), my_offset = c(83L, 81L, 85L,
81L, 79L, 71L, 86L, 70L, 73L, 90L, 90L, 74L, 81L, 84L, 83L, 84L,
86L, 84L, 87L, 70L, 77L, 90L, 79L, 76L, 76L, 80L, 87L, 78L, 89L,
71L, 74L, 80L, 71L, 89L, 72L, 84L, 75L, 78L, 84L, 74L, 83L, 74L,
80L, 80L, 86L, 70L, 82L, 70L, 88L, 73L, 85L, 86L, 76L, 75L, 89L,
85L, 71L, 82L, 72L, 79L, 75L, 76L, 86L, 70L, 70L, 72L, 80L, 82L,
87L, 74L, 85L, 77L, 71L, 73L, 75L, 86L, 78L), id = c("1", "2",
"2", "2", "2", "2", "2", "2", "2", "2", "2", "3", "3", "3", "3",
"3", "3", "3", "3", "3", "3", "3", "3", "3", "4", "4", "4", "4",
"4", "4", "4", "4", "4", "4", "4", "4", "4", "4", "4", "4", "4",
"4", "4", "4", "4", "4", "5", "5", "5", "5", "5", "5", "5", "5",
"5", "5", "5", "5", "5", "5", "5", "5", "5", "5", "5", "5", "5",
"5", "5", "5", "5", "5", "5", "5", "5", "5", "5")), row.names = c(NA,
-77L), class = "data.frame")
The datasets have the following characteristics:
- No recurring events
- Time-dependent covariates and artificial censoring
- More rows in the longitudinal data than in the survival data (explanation below)
- Different covariates in both dataset (explanation below)
Let's look at an hypothetical example for my use case. Consider patients who go to the doctor periodically. Every visit to the doctor generates a line in the survival dataset. Here, tstart represents the first time a person goes to the doctor, and tstop the next time there is a visit (tstart of the next visit). There is an event when the patient has developed a certain disease. At each visit, we take measure of the patient's health (var1.surv in train_cox). We want to use the previous covariates to predict if there will be a disease at tstop.
Some of the patients will visit the doctor once a month, some once every three months, others maybe twice a year when they don't feel well. For our longitudinal submodel, I want to have the same number of observations per year for each patient (I understand it might not be clear why I want that; simply assume it makes sense). Let's say we want 12 lines per patient per year, so one line every month. If patients A visits the doctor twice in January, we would need to summarize the two lines in the survival submodel in one row. If patient B does not visit in January, February and March, we would need to generate 3 rows in the longitudinal submodel while there is none for this period in the survival submodel. In our data, the second scenario happens more often, and that's why we have more rows in the longitudinal data compared to the survival data (77 rows vs 50 rows).
Finally, when we summarize or create new rows for the longitudinal data, the variables are not exactly the same than in the survival data. To see why, consider once again the example of patient A who visited the doctor twice in January. If we measure the patient's pressure every visit, we would have two rows, each with one pressure measurements. To summarise into one line in our longitudinal data, we could use the average patient's pressure and the maximum pressure in January. In this case, there would be two variables in the longitudinal submodel "linked" to one variable in the survival submodel. In the data I provided, this could be the variable var1.surv with the two variables var1.long and var2.long. This is why the variables are not the same. I hope it makes sense.
Here is the code with JMbayes2:
cox_model <-
coxph(
Surv(tstart, tstop, event) ~ var1.surv,
data=train_cox
)
mixed_model <-
mixed_model(
fixed=long1 ~ 1 + offset(log(my_offset)) + tstart + var1.long + var2.long,
random= ~ 1 + tstart | id,
data=train_mixed,
family=poisson()
)
joint_model_1 <-
jm(
cox_model,
mixed_model,
n_iter=500,
n_burnin=10,
time_var = "tstart"
)
# Call:
# jm(Surv_object = cox_model, Mixed_objects = mixed_model, time_var = "tstart",
# n_iter = 500, n_burnin = 10)
#
# Data Descriptives:
# Number of Groups: 5 Number of events: 3 (6%)
# Number of Observations:
# long1: 77
#
# DIC WAIC LPML
# marginal 300.7981 8407.52 -587.4744
# conditional 712.6566 24816.90 -595.8393
#
# Random-effects covariance matrix:
#
# StdDev Corr
# (Intr) 0.2371 (Intr)
# tstart 0.1855 -0.0164
#
# Survival Outcome:
# Mean StDev 2.5% 97.5% P Rhat
# var1.surv -1.2230 1.4821 -4.6594 1.0273 0.4231 1.0187
# value(long1) -0.5355 0.8202 -3.0937 0.2179 0.3497 1.5598
#
# Longitudinal Outcome: long1 (family = poisson, link = log)
# Mean StDev 2.5% 97.5% P Rhat
# (Intercept) -0.3153 1.0827 -3.6489 0.6282 0.6857 1.1407
# tstart 0.7003 0.6809 -0.1660 2.2227 0.1306 1.4146
# var1.long 0.0057 0.1864 -0.3852 0.3724 0.7592 1.0256
# var2.long -0.0857 0.1923 -0.4838 0.2495 0.8680 1.0204
#
# MCMC summary:
# chains: 3
# iterations per chain: 500
# burn-in per chain: 10
# thinning: 1
# time: 1.1 min
The function seems to work, but I would like to make sure that time dependent covariates from the longitudinal submodel are properly handled in the survival submodel when the data has the format described above. Should I merge the two datasets together so the survival submodel has access to the evolution of the time-varying covariates in the longitudinal dataset, or it is not necessary? Is it only necessary to have the longitudinal variable in the survival dataset, which is we add a random variable if it is not present in the source code?
My other question arose when I tried to use the predict function.
# create random column for the longitudinal variable
set.seed(8989)
# need to add a random value of the longitudinal variable to predict
train_cox$long1 <- rpois(n=nrow(train_cox), lambda=mean(train_mixed$long1))
predLong1 <- predict(joint_model_1, newdata=list(newdataL=train_mixed, newdataE=train_cox),
return_newdata=FALSE,
process="longitudinal",
type_pred = "response",
type="subject_specific"
)
# longitudinal prediction works fine
predSurv <- predict(joint_model_1, newdata=list(newdataL=train_mixed, newdataE=train_cox),
return_newdata=FALSE,
process="event",
type_pred = "response",
type="subject_specific"
)
# Error in prepare_Data_preds(object, newdataL2, newdataE2) :
# the number of groups/subjects in the longitudinal and survival datasets do not seem to match. A potential reason why this may be happening is missing data in some covariates used in the individual models.
# In addition: Warning messages:
# 1: In Surv(tstart, tstop, event) :
# Stop time must be > start time, NA created
I do not understand the error and the warning, since the data seems fine :
# the number of groups/subjects in the longitudinal and survival datasets do not seem to match. A potential reason why this may be happening is missing data in some covariates used in the individual models.
train_mixed$id %>% unique # [1] "1" "2" "3" "4" "5"
train_cox$id %>% unique # [1] "1" "2" "3" "4" "5"
# any NAs?
sapply(train_cox, function(y) sum(is.na(y)))
# tstart tstop event var1.surv id
# 0 0 0 0 0
sapply(train_mixed, function(y) sum(is.na(y)))
# tstart var1.long var2.long long1 my_offset id
# 0 0 0 0 0 0
all(train_cox$tstart < train_cox$tstop) # TRUE
Thank you for your time!