mcp icon indicating copy to clipboard operation
mcp copied to clipboard

Multiple predictors

Open lindeloev opened this issue 5 years ago • 15 comments

Each segment should take an arbitrary number of linear predictors. As with the segmented package, the only requirement is that one continuous predictor (say, x) is the dimension of the change point. The change point is simply the value on x where the predictions of y changes to a different regression model (parameter structure and/or values).

So this API should work. It has the following features:

  • [x] Support categorical predictors via dummy coding.
  • [x] Support interactions
  • [x] Each segment can have different numbers of predictors.
model = list(
    y ~ 1 + x*group + z + sigma(1 + group),  # interactions and main effects and a covariate.
    ~ 0 + x + ar(2, z),  # only one slope
    ~ 1 + group  # a range of x where group is the only predictor
)

JAGS-wise the indicator functions would be the same but now we additionally pass design matrices (X1_, X2_, etc.) and use inprod() per segment. The model above would be something like:

    # Priors for individual parameters
    cp_1 ~ ... T(MINX, MAXX)
    cp_2 ~ ... T(cp_1, MAXX)
    int_1 ~ dnorm(0, 1^-2)
    int_3 ~ ...
    xGroupFemale_1 ~ dnorm(0, 1)  # check R naming convention
    z_1 ~ dunif(0, 100)
    x_2 ~ dnorm(4, 3^-2)
    xGroupFemale_3 ~ ...

    # Model and likelihood
    for(i_ in 1:length(x1_)) {
        y_[i_] = (x_[i_] > cp_0) * (int_1 + inprod(c(xGroupFemale_1, z_1), x1_[i_, ])) + 
                (x_[i_] > cp_1) * (0 + inprod(c(x_2), x2_[i_, ])) +
                (x_[i_] > cp_2) * (int_3 + inprod(c(xGroupFemale_3), x3_[i_, ]))

        response[i_] ~ dnorm(y_[i_], sigma_[i_])
    }

where xi_ is a model matrix that is built R-side and x_ is par_x along which change points are defined. Implementing this adds the following work points:

Data structure

  • [x] Rename intercept: Intercept_i instead of int_i.
  • [x] Require par_x if there is not exactly one continuous predictor.
  • [x] Deicide parameter naming: stay close to lm and brms but add _segmentnumber.
  • [x] Split out RHS parameters from the segment table to a new long-format structure.
  • [x] Set appropriate priors. Distinguish between intercepts, dummies, and slopes?
  • [x] Raise error on rank deficiency using base::qr().

Modeling, sampling, and summaries

  • [x] Update get_formula() to match the new segment table.
  • [x] Update get_jagscode()
  • [x] Update run_jags() and get_jags_data() to work with design matrices. One per (dpar-segment) combo.
  • [x] Make sure that it applies for sigma(1 + x * group) etc. (I think it will out of the box).
  • [x] Verify that it works for plot_pars(), hypothesis(), summary(), fixef(), etc.
  • [ ] get_summary(): Translate between code parameter name and user-facing parameter name.

Simulated, fitted, and predicted values

  • [x] Change args to fit$simulate(fit, data, par1, par2, ...), i.e., add fit and replace par_x with data.frame/tibble.
  • [x] Check that the columns in data have the correct format. Set factor levels to match the original data.
  • [x] Make fit$simulate() a wrapper around a lower-level fast function to use internally. Call it fit$.internal$simulate_vec(par_x, cp_1, ..., rhs_par1, rhs_par2, ...). Only the former should do asserts, call add_simulated(), etc.
  • [x] Implement for ar()
  • [x] Implement for varying change points.
  • [x] Call simulate_vectorized() from all internal functions instead of fit$simulate().
  • [x] Ensure that it works for fitted(), predict(), pp_check(), etc.
  • [x] Fix existing mcp_examples
  • [x] Add new mcp_examples?

Plot

  • [ ] Allow faceting by any unique combination of (categorical) variables using plot(fit, facet_by = c("my_rhs", "my_varying_cp")). Still default to no facets.
  • [ ] Control spaghetti groupings and colors using color_by = c("my_categorical1", "my_categorical2). It defaults to color_by = "all_categorical", i.e. all unique combinations of categorical levels on RHS. This will also set the grouping for spaghettis. I think that color_by should pertain solely to the RHS which share change points. Varying change points will not be accepted.
  • [ ] Add a way to include a subset of the predictors: plot(fit, effects = "my_categorical1"). It's like brms::marginal_effects(). This should probably be implemented in tidy_samples().
  • [ ] Display only a subset of the levels using plot(fit, filter = data.frame(my_categorical1 = c("levelA", "levelB"), my_categorical2 = "level1"). This is like brms::marginal_effects(), only filter using a data.frame replaces int_conditions which is a named list. For variables in effects that are not in filter, all levels will be included. This should probably be implemented in tidy_samples().
  • [ ] (Add option to facet_by group levels in pp_check?)
  • [ ] Add pages to plot_pars()

Tests

  • [x] Pass existing test suite.
  • [ ] Write tests for combinations of main effects, factor-factor interactions, and factor-continuous interactions on ct, sigma, and ar. Also test the number of expected model parameters with and without intercepts.
  • [ ] Expect errors on multiple terms inside base functions, e.g., ~exp(1 + x ).
  • [ ] Test the new plot functions and tidy_draws()

lindeloev avatar Aug 10 '20 21:08 lindeloev

@lindeloev Out of curiosity, do you have a timeline on when this will be implemented?

sjmgarnier avatar Sep 14 '20 13:09 sjmgarnier

@sjmgarnier I got this to work locally but while it provides much more modeling options (and is easier to maintain/extend), sampling of currently supported models take double the time. mcp is already like 100x slower than other packages. Tries to be useful for modeling options rather than speed. So I'm a bit in two minds whether to continue down this path. What do you think?

lindeloev avatar Sep 14 '20 18:09 lindeloev

@lindeloev I'll answer very selfishly by saying that I need it for a project that I'm working on :-) I could not find a satisfying alternative to perform weighted segmented regressions with random effects. And I'm a patient man with a powerful computer, so I can wait for a few minutes that the fit finishes. But I'm also very well aware of how time-consuming it is to develop and maintain packages, so I would completely understand if you decided to focus on other priorities.

sjmgarnier avatar Sep 14 '20 19:09 sjmgarnier

Cool, mcp is henceforth aimed at patient users with powerful computers :-) Yes, for data with < 200 data points, we're talking a slowdown from ~10 secs to ~20 secs so it's not the end of the world. I think this would be awesome so it's my top priority for the next release (mcp 0.4). Though I think that it will likely be at least a few months before it is out.

The first version with this will likely not have random effects on the RHS. But using a categorical intercept (like group in the OP) should be reasonably close in many scenarios since shrinkage is often minor.

lindeloev avatar Sep 14 '20 19:09 lindeloev

I'm working on this now and have almost finished making all design decisions. Luckily, I found an implementation that won't negatively affect performance. Expect release in a few months, depending on how hard it is to make sensible plot() etc.

lindeloev avatar Dec 15 '20 09:12 lindeloev

Is this implemented in a development version? A reviewer demands a multivariate analysis :/

mattmoo avatar Jan 17 '21 05:01 mattmoo

@mattmoo I just pushed the development version to branch v0.4 here. I think you can install it using remotes::install_github("lindeloev/[email protected]"). You can fit it and see parameter summaries, including plot_pars(fit). It works for e.g. y ~ x + sigma(1 + x:group) too.

But fit$simulate(), plot(), fitted()/predict(), and pp_check() won't, though I'm making good progress! So you may want to shift back and forth between versions if you need some of the old functionality for one-predictor-only models.

I've only tested it on gaussian models so far, so please tripple-check. And any feedback and ideas would be much appreciated, BTW!

lindeloev avatar Jan 17 '21 09:01 lindeloev

There is good progress on this. I just pushed the latest version to branch v0.4. Install using remotes::install_github("lindeloev/[email protected]"). The easiest way to see it in action is running result = mcp_example("multiple").

I'm basically just missing plot() support for varying effects and categorical predictors.

lindeloev avatar Feb 20 '21 22:02 lindeloev

That's great! I'm waiting on compute resources to actually run the analysis (50,000 datapoints :/)

mattmoo avatar Feb 20 '21 22:02 mattmoo

I just tested the performance locally on a Ryzen 5 3600. For 55.000 data points and a 15-parameter model with categorical predictors, I get around 1 sample per second. So if you run in parallel with default settings (3000 warmup iters + 1500 data iters), you might complete sampling in a matter of hours?

ex = mcp_example("multiple")
df_tmp = tidyr::expand_grid(rep = 1:250, ex$data) %>%  # upscale to 55000 data points
  dplyr::select(-rep)
fit = mcp(ex$model, df_tmp, par_x = "x", chains = 1, adapt = 100, iter = 100)

I just pushed a new commit to v0.4 which requires ~10% of the memory of the previous version during sampling. Maybe that helped too.

lindeloev avatar Feb 21 '21 22:02 lindeloev

I'll give it a go, from my previous experience with these data (see link), the sampling does not converge so quickly (and I do not have such a nice processor!).

mattmoo avatar Feb 21 '21 22:02 mattmoo

Hi, is this issue still up to date or is there additional information for how to incorporate additional predictors?

adrose avatar Aug 03 '23 19:08 adrose

@adrose Unfortunately not. The v0.4 branch currently doesn't run out-of-the-box due to some backwards-incompatible changes in the dependency packages, that I only incorporated into the v0.3 series. I'm presently prioritizing another project higher but really looking forward to getting v0.4 out since it's awesome!

lindeloev avatar Aug 03 '23 20:08 lindeloev

Love the package. Looking forward to v0.4.

gavanmcgrath avatar Aug 09 '23 02:08 gavanmcgrath