mcp
mcp copied to clipboard
Multiple predictors
Each segment should take an arbitrary number of linear predictors. As with the segmented package, the only requirement is that one continuous predictor (say, x) is the dimension of the change point. The change point is simply the value on x where the predictions of y changes to a different regression model (parameter structure and/or values).
So this API should work. It has the following features:
- [x] Support categorical predictors via dummy coding.
- [x] Support interactions
- [x] Each segment can have different numbers of predictors.
model = list(
y ~ 1 + x*group + z + sigma(1 + group), # interactions and main effects and a covariate.
~ 0 + x + ar(2, z), # only one slope
~ 1 + group # a range of x where group is the only predictor
)
JAGS-wise the indicator functions would be the same but now we additionally pass design matrices (X1_, X2_, etc.) and use inprod() per segment. The model above would be something like:
# Priors for individual parameters
cp_1 ~ ... T(MINX, MAXX)
cp_2 ~ ... T(cp_1, MAXX)
int_1 ~ dnorm(0, 1^-2)
int_3 ~ ...
xGroupFemale_1 ~ dnorm(0, 1) # check R naming convention
z_1 ~ dunif(0, 100)
x_2 ~ dnorm(4, 3^-2)
xGroupFemale_3 ~ ...
# Model and likelihood
for(i_ in 1:length(x1_)) {
y_[i_] = (x_[i_] > cp_0) * (int_1 + inprod(c(xGroupFemale_1, z_1), x1_[i_, ])) +
(x_[i_] > cp_1) * (0 + inprod(c(x_2), x2_[i_, ])) +
(x_[i_] > cp_2) * (int_3 + inprod(c(xGroupFemale_3), x3_[i_, ]))
response[i_] ~ dnorm(y_[i_], sigma_[i_])
}
where xi_ is a model matrix that is built R-side and x_ is par_x along which change points are defined. Implementing this adds the following work points:
Data structure
- [x] Rename intercept:
Intercept_iinstead ofint_i. - [x] Require
par_xif there is not exactly one continuous predictor. - [x] Deicide parameter naming: stay close to
lmandbrmsbut add_segmentnumber. - [x] Split out RHS parameters from the segment table to a new long-format structure.
- [x] Set appropriate priors. Distinguish between intercepts, dummies, and slopes?
- [x] Raise error on rank deficiency using
base::qr().
Modeling, sampling, and summaries
- [x] Update
get_formula()to match the new segment table. - [x] Update
get_jagscode() - [x] Update
run_jags()andget_jags_data()to work with design matrices. One per (dpar-segment) combo. - [x] Make sure that it applies for
sigma(1 + x * group)etc. (I think it will out of the box). - [x] Verify that it works for
plot_pars(),hypothesis(),summary(),fixef(), etc. - [ ]
get_summary(): Translate between code parameter name and user-facing parameter name.
Simulated, fitted, and predicted values
- [x] Change args to
fit$simulate(fit, data, par1, par2, ...), i.e., addfitand replacepar_xwithdata.frame/tibble. - [x] Check that the columns in
datahave the correct format. Set factor levels to match the original data. - [x] Make
fit$simulate()a wrapper around a lower-level fast function to use internally. Call itfit$.internal$simulate_vec(par_x, cp_1, ..., rhs_par1, rhs_par2, ...). Only the former should do asserts, calladd_simulated(), etc. - [x] Implement for
ar() - [x] Implement for varying change points.
- [x] Call
simulate_vectorized()from all internal functions instead offit$simulate(). - [x] Ensure that it works for
fitted(),predict(),pp_check(), etc. - [x] Fix existing
mcp_examples - [x] Add new
mcp_examples?
Plot
- [ ] Allow faceting by any unique combination of (categorical) variables using
plot(fit, facet_by = c("my_rhs", "my_varying_cp")). Still default to no facets. - [ ] Control spaghetti groupings and colors using
color_by = c("my_categorical1", "my_categorical2). It defaults tocolor_by = "all_categorical", i.e. all unique combinations of categorical levels on RHS. This will also set the grouping for spaghettis. I think thatcolor_byshould pertain solely to the RHS which share change points. Varying change points will not be accepted. - [ ] Add a way to include a subset of the predictors:
plot(fit, effects = "my_categorical1"). It's likebrms::marginal_effects(). This should probably be implemented intidy_samples(). - [ ] Display only a subset of the levels using
plot(fit, filter = data.frame(my_categorical1 = c("levelA", "levelB"), my_categorical2 = "level1"). This is likebrms::marginal_effects(), onlyfilterusing a data.frame replacesint_conditionswhich is a named list. For variables ineffectsthat are not infilter, all levels will be included. This should probably be implemented intidy_samples(). - [ ] (Add option to facet_by group levels in pp_check?)
- [ ] Add pages to
plot_pars()
Tests
- [x] Pass existing test suite.
- [ ] Write tests for combinations of main effects, factor-factor interactions, and factor-continuous interactions on ct, sigma, and ar. Also test the number of expected model parameters with and without intercepts.
- [ ] Expect errors on multiple terms inside base functions, e.g.,
~exp(1 + x ). - [ ] Test the new plot functions and
tidy_draws()
@lindeloev Out of curiosity, do you have a timeline on when this will be implemented?
@sjmgarnier I got this to work locally but while it provides much more modeling options (and is easier to maintain/extend), sampling of currently supported models take double the time. mcp is already like 100x slower than other packages. Tries to be useful for modeling options rather than speed. So I'm a bit in two minds whether to continue down this path. What do you think?
@lindeloev I'll answer very selfishly by saying that I need it for a project that I'm working on :-) I could not find a satisfying alternative to perform weighted segmented regressions with random effects. And I'm a patient man with a powerful computer, so I can wait for a few minutes that the fit finishes. But I'm also very well aware of how time-consuming it is to develop and maintain packages, so I would completely understand if you decided to focus on other priorities.
Cool, mcp is henceforth aimed at patient users with powerful computers :-) Yes, for data with < 200 data points, we're talking a slowdown from ~10 secs to ~20 secs so it's not the end of the world. I think this would be awesome so it's my top priority for the next release (mcp 0.4). Though I think that it will likely be at least a few months before it is out.
The first version with this will likely not have random effects on the RHS. But using a categorical intercept (like group in the OP) should be reasonably close in many scenarios since shrinkage is often minor.
I'm working on this now and have almost finished making all design decisions. Luckily, I found an implementation that won't negatively affect performance. Expect release in a few months, depending on how hard it is to make sensible plot() etc.
Is this implemented in a development version? A reviewer demands a multivariate analysis :/
@mattmoo I just pushed the development version to branch v0.4 here. I think you can install it using remotes::install_github("lindeloev/[email protected]"). You can fit it and see parameter summaries, including plot_pars(fit). It works for e.g. y ~ x + sigma(1 + x:group) too.
But fit$simulate(), plot(), fitted()/predict(), and pp_check() won't, though I'm making good progress! So you may want to shift back and forth between versions if you need some of the old functionality for one-predictor-only models.
I've only tested it on gaussian models so far, so please tripple-check. And any feedback and ideas would be much appreciated, BTW!
There is good progress on this. I just pushed the latest version to branch v0.4. Install using remotes::install_github("lindeloev/[email protected]"). The easiest way to see it in action is running result = mcp_example("multiple").
I'm basically just missing plot() support for varying effects and categorical predictors.
That's great! I'm waiting on compute resources to actually run the analysis (50,000 datapoints :/)
I just tested the performance locally on a Ryzen 5 3600. For 55.000 data points and a 15-parameter model with categorical predictors, I get around 1 sample per second. So if you run in parallel with default settings (3000 warmup iters + 1500 data iters), you might complete sampling in a matter of hours?
ex = mcp_example("multiple")
df_tmp = tidyr::expand_grid(rep = 1:250, ex$data) %>% # upscale to 55000 data points
dplyr::select(-rep)
fit = mcp(ex$model, df_tmp, par_x = "x", chains = 1, adapt = 100, iter = 100)
I just pushed a new commit to v0.4 which requires ~10% of the memory of the previous version during sampling. Maybe that helped too.
I'll give it a go, from my previous experience with these data (see link), the sampling does not converge so quickly (and I do not have such a nice processor!).
Hi, is this issue still up to date or is there additional information for how to incorporate additional predictors?
@adrose Unfortunately not. The v0.4 branch currently doesn't run out-of-the-box due to some backwards-incompatible changes in the dependency packages, that I only incorporated into the v0.3 series. I'm presently prioritizing another project higher but really looking forward to getting v0.4 out since it's awesome!
Love the package. Looking forward to v0.4.