Conditional logit (clogit) models
Currently, projpred does not work with models estimated via stan_clogit; e.g.
example(stan_clogit, package = "rstanarm") # creates post
projpred::get_refmodel(post) # fails
But I wanted to make sure I am clear as to what would need to happen for clogit models to become supported by the projpred package, so that @jgabry and I can help. At help(clogit, package = "survival"), it is pointed out that the likelihood for a clogit model is a special case of that for a Cox model, if that makes things easier. The best documentation of a clogit model is actually in Stata. I would guess that the form of the likelihood is sufficiently nice that projpred should work eventually, but the model (and its submodels) lack an intercept because it is not identified under the likelihood alone, so that is different from what projpred is accustomed to. Also, the prediction task is essentially leave-one-stratum-out rather than leaving one individual out of some stratum.
I have no experience with stan_clogit. Perhaps @aloctavodia has? (He is the new maintainer)
@avehtari knows the model. In any event, all the post-estimation functions in rstanarm work with stan_clogit output, so I think it is mostly a matter of synchronizing with what projpred is expecting.
That Stata pdf did not have equations. Quite clear equations are in Wikipedia https://en.wikipedia.org/wiki/Conditional_logistic_regression.
rstanarm vignette https://mc-stan.org/rstanarm/reference/stan_clogit.html shows the data and model formula
dat <- infert[[order](https://rdrr.io/r/base/order.html)(infert$stratum), ] # order by strata
post <- stan_clogit(case ~ spontaneous + induced + (1 | education),
strata = stratum,
data = dat,
subset = parity <= 2,
QR = TRUE,
chains = 2, iter = 500)
and head of the data looks like
education age parity induced case spontaneous stratum pooled.stratum
1 0-5yrs 26 6 1 1 2 1 3
84 0-5yrs 26 6 2 0 0 1 3
166 0-5yrs 26 6 2 0 0 1 3
2 0-5yrs 42 1 1 1 0 2 1
85 0-5yrs 42 1 0 0 0 2 1
167 0-5yrs 42 1 0 0 0 2 1
The crucial part here is that one factorizing likelihood term depends on one stratum, which has at least two rows in the data frame (in this case three rows) and these rows may have different covariate values (like here induced and spontaneous may be different for different rows in the same stratum).
@fweber144, maybe augmented-data projection could be used for this, too? Or is this too different?
A chatbot seems to think that project could be made to support clogit models by calling posterior_predict(stan_clogit_output) to get 4000 or so outcome vectors and then calling survival::clogit() on each of those outcome vectors using the subset of predictors passed to the predictor_terms argument of project. I would not be surprised if it turns out to be more complicated than that, but how so?
That is one way. Using posterior_predict() includes additional noise. For example, with normal model, we can do the projection using posterior_epred() (and similarly for exponential family GLMs we can use exponential family natural parameters). It would be possible to reduce noise by using the augmented data projection approach as described in https://doi.org/10.1007/s00180-024-01506-0. The augmented data set has for each of N original observation as many rows as there are different outcome values (I guess in this case all different combinations for each stratum), and when doing the fitting the different outcome values are weighted by their predictive probabilities. See Section 2.3 and eq (2) in that paper. As survival::clogit() supports weights, this should be easy (unless the augmented data set gets too big).
Augmented data projection is an interesting possibility for clogit models because the denominator of the likelihood function already requires evaluating the probability of observing all possible configurations of the outcome vector in a stratum that have a given sum. That can be a computational nightmare, so we have to use dynamic programming in stan_clogit. As the documentation of survival::clogit says:
If a particular strata had say 10 events out of 20 subjects we have to add up a denominator that involves all possible ways of choosing 10 out of 20, which is 20!/(10! 10!) = 184756 terms. Gail et al describe a fast recursion method which partly ameliorates this ...
So, I think literally making the augmented dataset would exhaust the RAM in many cases, but we could just use optim to find the $\boldsymbol{\theta}_c$ that maximizes the weighted log-likelihood where the weights come from rows of posterior_epred(reference_model) and the log-likelihood utilizes Gail et al.'s recursive formula (in the common case of exactly one success per stratum it simplifies to log_sum_exp(eta_stratum))
Or at least that would work if there are no splines or terms like (-1 + x | strata) in the submodel. In those situations, the chatbot seems to think that doing what projpred already does by modeling the linear predictor of the reference model would be fine, except that
- Each fixed effect in the submodel needs to be centered within-strata (but the random effects do not)
- The linear predictions from the reference model need to be centered within-strata
- There is no global intercept, but a null model in a clogit context puts equal success probability within each stratum
Using Gail et al. method seems sensible. I have low trust in any LLM advice on projpred as there is not much material in internet for them to learn. But there might be enough material on maximum likelihood for clogit, so that asking the questions in that context may get useful answers. Unfortunately, I don't have enough intuition about clogit that I could assess whether the three exceptions you list make sense.
Let me see if I can summarize what the chatbot is telling me. In a clogit model, the strata are conditionally independent, but the outcomes within a stratum are not conditionally independent because of the constraint on their sum. The likelihood contribution from one stratum is given on Wikipedia as (may be invisible in dark mode)
We could rewrite that stratum's contribution to the likelihood as
$$\mathcal{L}\left(\boldsymbol{\beta}; \mathbf{y}\right) = \frac{e^{\mathbf{y}^\top \boldsymbol{\eta}\left(\boldsymbol{\beta}\right)}}{Z\left(\boldsymbol{\beta}\right)},$$
where $\mathbf{y}$ is a binary vector whose size is equal to the number of individuals in this stratum, $\boldsymbol{\eta}\left(\boldsymbol{\beta}\right)$ is a vector of linear predictors, and the normalizing factor $Z\left(\boldsymbol{\beta}\right)$ does not depend on the particular $\mathbf{y}$ because it sums over all possible outcome vectors that have the given sum.
As emphasized in the documentation of survival::clogit, $Z\left(\boldsymbol{\beta}\right)$ has a huge number of terms unless the number of fixed number of successes in the stratum is $1$ (which is common in public health but less so in economics).
If a particular strata had say 10 events out of 20 subjects we have to add up a denominator that involves all possible ways of choosing 10 out of 20, which is 20!/(10! 10!) = 184756 terms. Gail et al describe a fast recursion method which partly ameliorates this ...
The dynamic programming approach of Gail et al. (which was actually first derived by Howard) is not so hard to write down. Here is an not numerically stable version in Stan code:
real Z(int successes, vector e_eta) { // e_eta is exp(eta) and eta is the linear predictor vector
int obs = rows(e_eta);
if (successes == 0) return 1;
if (successes == obs) return prod(e_eta);
if (successes == 1) return sum(e_eta);
int dp1 = obs - successes + 1;
vector[dp1] B = rep_vector(1, dp1);
for (i in 1:(successes - 1)) B = cumulative_sum(B .* segment(e_eta, i, dp1));
return dot_product(B, e_eta[successes:obs]);
}
The outcome vector within a stratum is binary and has a finite size, so augmented-data projection is technically applicable. If we just think of a cluster as holding one draw, $\boldsymbol{\beta}^\ast$, from the reference model's posterior distribution, the optimization problem in equation (2) from the paper for a clogit submodel with $p$ predictors and $S$ strata could be written like
$$\boldsymbol{\beta} = \underset{\boldsymbol{\beta} \in \mathbb{R}^p}{\arg\max} \sum\limits_{s = 1}^S \sum\limits_{\widetilde{\mathbf{y}} \in \mathcal{Y}_s} \Pr\left(\widetilde{\mathbf{y}} \mid \boldsymbol{\beta}^\ast\right) \log L\left(\boldsymbol{\beta}; \widetilde{\mathbf{y}}\right).$$
However, the inner sum could have a huge number of terms, so creating the augmented dataset would often exhaust the RAM. So, we have to simplify it analytically. Since $\log L\left(\boldsymbol{\beta}; \widetilde{\mathbf{y}}\right) = \widetilde{\mathbf{y}}^\top \boldsymbol{\eta}\left(\boldsymbol{\beta}\right) - \log Z\left(\boldsymbol{\beta}\right)$ and the second term does not depend on $\widetilde{\mathbf{y}}$, the inner sum can be split and expressed as
$$- \log Z\left(\boldsymbol{\beta}\right) + \sum\limits_{\widetilde{\mathbf{y}} \in \mathcal{Y}_s} \Pr\left(\widetilde{\mathbf{y}} \mid \boldsymbol{\beta}^\ast\right) \widetilde{\mathbf{y}}^\top \boldsymbol{\eta}\left(\boldsymbol{\beta}\right)$$
because
$$\sum\limits_{\widetilde{\mathbf{y}} \in \mathcal{Y}_s} \Pr\left(\widetilde{\mathbf{y}} \mid \boldsymbol{\beta}^\ast\right)= 1.$$
Now, $\widetilde{\mathbf{y}}^\top \boldsymbol{\eta}\left(\boldsymbol{\beta}\right)$ is also a sum of the linear predictors for the observations where the future outcome is $1$ rather than $0$. So, we have to do a double-summation cleverly. Imagine a sparse matrix $\mathbf{M}_s\left(\boldsymbol{\beta}\right)$ where
- The rows enumerate all the ways that $\widetilde{\mathbf{y}}$ can have $k_s$ successes
- The $J_s$ columns enumerate the $J_s$ observations in the $s$-th stratum
- The non-zero elements are linear predictors when the intersection of the row and column indices implies that observation is a success. Thus, the non-zero elements in each column are all the same.
In principle, we could sum each row of $\mathbf{M}_s\left(\boldsymbol{\beta}\right)$, weight that by the corresponding $\Pr\left(\widetilde{\mathbf{y}} \mid \boldsymbol{\beta}^\ast\right)$, and sum over configurations, but that would be computationally inefficient because the non-zero elements in each row are not all the same. Alternatively, we could take a weighted sum of each column of $\mathbf{M}_s\left(\boldsymbol{\beta}\right)$ --- where the weights are the corresponding probabilities under the reference model --- and then sum over observations in the stratum. Since the non-zero elements in the $j$-th column of $\mathbf{M}_s\left(\boldsymbol{\beta}\right)$ are the same, say $\eta_j = \mathbf{x}_j^\top \boldsymbol{\beta}$, a weighted sum of the $j$-th column of $\mathbf{M}_s\left(\boldsymbol{\beta}\right)$ where the weights are the reference model probabilities is just $\eta_j$ times the sum of the reference model probabilities that correspond to successes for the $j$-th observation in the stratum. But
$$\sum\limits_{\widetilde{\mathbf{y}} \in \mathcal{Y}_s} \Pr\left(\widetilde{y}_j = 1 \mid \boldsymbol{\beta}^\ast\right)$$
is the probability that the $j$-th observation in the stratum is a future success given $\boldsymbol{\beta}^\ast$, which is $\mathbb{E}\left[\widetilde{y}_j \mid \boldsymbol{\beta}^\ast\right]$ and that is given by the elements of E <- posterior_epred(reference_model). E can be pre-computed, so
$$\underset{\boldsymbol{\beta} \in \mathbb{R}^p}{\arg\max} \sum\limits_{s = 1}^S \sum\limits_{\widetilde{\mathbf{y}} \in \mathcal{Y}_s} \Pr\left(\widetilde{\mathbf{y}} \mid \boldsymbol{\beta}^\ast\right) \log L\left(\boldsymbol{\beta}; \widetilde{\mathbf{y}}\right)$$
becomes
$$\underset{\boldsymbol{\beta} \in \mathbb{R}^p}{\arg\max} \sum\limits_{s = 1}^S \left[-\log Z_s\left(\beta\right) + \sum_{j = 1}^{J_s} E_{c, j + \sum_{k = 1}^{s - 1} J_k} y_j \eta_j\left(\boldsymbol{\beta}\right)\right]$$
Does that make sense?
@fweber144, maybe augmented-data projection could be used for this, too? Or is this too different?
Yes, in theory, augmented-data projection could be used for this. However, with regards to the implementation of the augmented-data projection in projpred, some changes would have to be made. The reason is that until now, we construct the augmented dataset in projpred by "row-binding" the original dataset $C$ times, with $C$ denoting the number of response categories. So the augmented dataset currently has $N \cdot C$ rows, with $N$ denoting the number of observations. For conditional logit regression, the augmented dataset would need to have
rows (using the notation from above). In case of 1:1 matching, the augmented dataset for conditional logit regression should have the same number of rows as the augmented dataset for a "conventional" logit regression, namely $N \cdot 2$ because $S = \frac{N}{2}$ (guaranteed to be integer by construction in this case), $m = 2$, $k_s = 1$ (for all strata), and $J_s = 2$ (for all strata). The (artificial) outcome values in these two kinds of augmented datasets should also be the same (namely one row with outcome 0 and one row with outcome 1 for each observation). So the current implementation might work for a conditional logit regression with 1:1 matching. However, even for 1:1 matching, I'm not sure whether the data column included in the formula's survival::strata() part (= argument strata of rstanarm::stan_clogit()) would need some special treatment within projpred. For a conditional logit regression that is not 1:1 matching, the augmented dataset can have a huge number of rows (as both of you have pointed out) and even for a feasible number of rows, the augmented dataset would have to be constructed in a different way. The current way described above (yielding $N \cdot C$ rows, see lines https://github.com/stan-dev/projpred/blob/581d8a996841e8f86a928b8bdf2782a592c986bb/R/formula.R#L571-L574) would not work anymore in those cases.
Does that make sense?
Yes, I think this makes sense. However, this means that you would have to use a custom div_minimizer function (div_minimizer is an argument of init_refmodel(), so you would have to use a custom reference model object as well) and I cannot guarantee that a custom div_minimizer function would suffice. The special data format might require some more changes.
EDIT: GitHub did not render the display-mode math correctly, so I replaced it with an image.