projpred icon indicating copy to clipboard operation
projpred copied to clipboard

Subsampled LOO

Open fweber144 opened this issue 3 years ago • 16 comments

In cv_varsel.refmodel(), the observation-specific weights are hard-coded to be all equal (line https://github.com/stan-dev/projpred/blob/0cabd73548eec32e9b675f15cd39d377a40f180a/R/cv_varsel.R#L221 ), but in case of subsampling LOO-CV, they might be different (as the comment says there, too). Thus, shouldn't that line be modified?

fweber144 avatar Mar 19 '21 11:03 fweber144

Yes. This is a bit less straightforward to fix as we can compute both approximate LOO or full LOO, so I think the right weights should be returned by loo_varsel and kfold_varsel themselves. I'll work on this after resolving the easier issues.

AlejandroCatalina avatar Mar 22 '21 12:03 AlejandroCatalina

Related comments/discussions may be found in issue #173 (enumeration point 4) and PR #188. Now that I understand the purpose of w better, I think constant weights for LOO-CV are ok (no matter whether subsampled or not) but they might be inappropriate for K-fold CV with folds of very different sizes.

fweber144 avatar Aug 11 '21 07:08 fweber144

I have now read Magnusson et al. (2019) and Magnusson et al. (2020), but I'm still confused how projpred wants to perform the subsampled LOO CV. More importantly, I have the strong feeling that the current implementation is not correct. What I learned from Magnusson et al. (2019) and Magnusson et al. (2020) makes me think the problems with the implementation in projpred get deeper than I initially thought.

What is important for projpred is that:

  1. Magnusson et al. (2019) propose to perform probability-proportional-to-size sampling (PPS) combined with the Hansen-Hurwitz (HH) estimator.
  2. Magnusson et al. (2020) propose to perform simple random sampling (SRS) combined with the difference estimator. The reason for this new proposal is that the approach by Magnusson et al. (2019) is not optimal for model comparisons (as explained by Magnusson et al., 2020).

However, in my opinion, neither the approach by Magnusson et al. (2019) nor that by Magnusson et al. (2020) is currently used in projpred. The key equation from Magnusson et al. (2019) is equation (4). The key equation from Magnusson et al. (2020) is equation (7). However, I can't find any of these two equations in projpred. The appropriate place would probably be in lines https://github.com/stan-dev/projpred/blob/dd4bc36aba24ea680050f680d4c1cf643a3680fc/R/summary_funs.R#L171-L179 (these lines only refer to the ELPD as performance statistic; projpred has more performance statistics, of course), but this does not correspond to equation (4) from Magnusson et al. (2019) and neither to equation (7) from Magnusson et al. (2020). The SE estimators from these lines also don't correspond to the SE estimators from Magnusson et al. (2019, equation (6)) and Magnusson et al. (2020, equation (9)). The subsampling itself (performed in lines https://github.com/stan-dev/projpred/blob/dd4bc36aba24ea680050f680d4c1cf643a3680fc/R/cv_varsel.R#L839-L840 ) also confuses me. For example, for the approach by Magnusson et al. (2019), I would have expected the logarithms of the predictive densities (not the predictive densities themselves) in the definition of w, and then sampling with replacement.

So if the current implementation of subsampled LOO CV in projpred is not correct, then I think projpred should not support subsampled LOO CV until this is fixed.

Of course, I could be totally wrong and neither Magnusson et al. (2019) nor Magnusson et al. (2020) are relevant for subsampled LOO CV in projpred. But then, I would be even more confused because there is this comment: https://github.com/stan-dev/projpred/blob/dd4bc36aba24ea680050f680d4c1cf643a3680fc/R/cv_varsel.R#L821-L825 which indicates that projpred wants to use the approach by Magnusson et al. (2019).

If my assumption is correct and projpred needs a fix, then a way to implement either the approach by Magnusson et al. (2019) or that by Magnusson et al. (2020) might be to rely on loo::loo_subsample() (see this vignette). This is just a rough idea, though, and I don't know if it is really feasible.

Of course, given these new findings, my statement

[...] I think constant weights for LOO-CV are ok (no matter whether subsampled or not) [...]

from the comment above (see also enumeration point 4 in #173) is not correct. At that time, I hadn't realized yet that projpred (probably) wants to use the approach by Magnusson et al. (2019) which requires sampling with replacement, which in turn means that each observation can occur more than once in the subsample (in contrast to my statement

[...] after all, they all represent a single observation [...]

from #188) so that when aggregating across the subsampled LOO CV folds, non-constant weights make sense (but note that these would have to be inverse-probability weights, in contrast to the currently used non-inversed probability weights).

References

Magnusson, M., Andersen, M., Jonasson, J., and Vehtari, A. (2019). Bayesian leave-one-out cross-validation for large data. In Proceedings of the 36th International Conference on Machine Learning, 4244-4253. URL: https://proceedings.mlr.press/v97/magnusson19a.html.

Magnusson, M., Vehtari, A., Jonasson, J., and Andersen, M. (2020). Leave-one-out cross-validation for Bayesian model comparison in large data. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, 341-351. URL: https://proceedings.mlr.press/v108/magnusson20a.html.

fweber144 avatar Oct 11 '21 19:10 fweber144

An alternative to not supporting subsampled LOO CV at all anymore might be to throw a warning concerning a potentially incorrect implementation.

fweber144 avatar Oct 12 '21 14:10 fweber144

See also

I also noticed that <vsel_object>$summaries$ref$w (so w for the reference model) is always NULL which is strange because it looks like lines https://github.com/stan-dev/projpred/blob/33047d6c9ea98d0644cb0c8503424f61ab0127af/R/summary_funs.R#L76-L78 expect a non-NULL element w also for the reference model (which would also make sense if w is supposed to contain the weights of the CV folds the observations are in because the reference model's performance is also cross-validated if the submodels' performances are).

in this comment. When fixing subsampled LOO CV, I guess that also needs to be fixed.

fweber144 avatar Jul 26 '22 09:07 fweber144

More issues/questions concerning subsampled LOO CV:

  1. NAs in get_stat()'s baseline objects mu.bs and lppd.bs are not handled consistently across the different stats (I think the handling from RMSE and AUC also needs to be used for all other stats). Furthermore, I think n_notna needs to be updated according to the NAs in the baseline objects mu.bs and lppd.bs if baseline = "best". I have now (PR #350) also added n_notna.bs which could perhaps be re-used.
  2. For the AUC, the modified mu doesn't even seem to be used (see lines https://github.com/stan-dev/projpred/blob/e7080db29e5d77c628b067a97a6da80b72a69156/R/summary_funs.R#L295-L307).
  3. It seems like loo_varsel()$d_test always contains all observations as test points, but line https://github.com/stan-dev/projpred/blob/e7080db29e5d77c628b067a97a6da80b72a69156/R/summary_funs.R#L103 doesn't reduce this down to the subsampled observations. Is this on purpose?

fweber144 avatar Sep 14 '22 12:09 fweber144

The comment from https://github.com/stan-dev/projpred/issues/94#issuecomment-1195207164 has been addressed by commit 2bd6b4efdd6a2f327d73c563da6f6fe451f4470d (within PR #475).

Question 1 from https://github.com/stan-dev/projpred/issues/94#issuecomment-1246694046 has now been addressed by commit 7232148561c13475d13392a7ae841ceb39d2b80c (within PR #475).

Question 2 from https://github.com/stan-dev/projpred/issues/94#issuecomment-1246694046 has now been addressed by commit d109c371de960ddf1e2567eb9e4c0fc5e68339d9 (within PR #475).

Question 3 from https://github.com/stan-dev/projpred/issues/94#issuecomment-1246694046 doesn't need to be addressed (i.e., it can be ignored) because that behavior was probably indeed on purpose (non-subsampled observations have NA as predicted value and so they will be ignored by get_stat()).

fweber144 avatar Nov 09 '23 20:11 fweber144

For the future: In the code, I have added comments starting with TODO (subsampled PSIS-LOO CV) which need to be addressed. However, addressing these comments does not guarantee a correctly behaving subsampled PSIS-LOO CV (in particular, there are probably deeper methodological issues that need to be addressed as well, see above).

fweber144 avatar Nov 09 '23 20:11 fweber144

And a question to @AlejandroCatalina: Is it on purpose that the bootstrap samples are drawn from the set of all observations (i.e., with the NA observations included)? Couldn't it also make sense to draw the bootstrap samples from only the subsampled observations (i.e., the non-NA observations)?

fweber144 avatar Nov 09 '23 21:11 fweber144

Yeah it would be okay too I think

AlejandroCatalinaF avatar Nov 10 '23 02:11 AlejandroCatalinaF

@avehtari: Still to do (or to check) here:

  • https://github.com/stan-dev/projpred/issues/94#issuecomment-940373365
  • https://github.com/stan-dev/projpred/issues/94#issuecomment-1804676744
  • Draw the bootstrap samples from only the subsampled observations (https://github.com/stan-dev/projpred/issues/94#issuecomment-1804681736)?

fweber144 avatar Dec 19 '23 14:12 fweber144

(Sorry for not checking this 2 years ago!)

  • loo_ref_oscale <- apply(loglik_forPSIS + lw, 2, log_sum_exp) at https://github.com/stan-dev/projpred/blob/15dac2b0dfc9b8cdd694171d0fb4ea44548784d2/R/cv_varsel.R#L665 is equivalent to pointwise elpd_loo's

  • loo_subsample_pps function has wcv <- exp(lppd - max(lppd)) which is wrong, and should be either abs(lppd) or -lppd + max(lppd) +1.

  • loo_subsample_pps returns the weights, but they are not used elsewhere in the code, and thus the current code is not actually doing PPS

  • After re-reading both PPS and difference estimator papers, I think we should use difference estimator.

  • The main purpose of using subsampling LOO in projpred is when validate_search=TRUE so that we can reduce the number of times the search is repeated. If we assume that full LOO has been used for the full data search path, we can use full data LOO for model size k as an approximation for all models of size k. When estimating the LOO performance in validate_search for a model size k, the first term in eq (7) is the full data model size k model elpd (using the reference model PSIS-LOO weights), and in the second term \pi_j is the sum of pointwise elpds from different search paths for model size k. We get the elpd difference to the reference model correspondingly. The benefit of difference estimator with SRS is that we get the diff_se via eq (9).

  • Difference estimator with SRS is using resampling without replacement, and there are no weights.

avehtari avatar Dec 27 '23 13:12 avehtari

Here is a demonstration of subsampling LOO with simple random sampling and difference estimator

library(rstanarm)
library(projpred)
library(tidyr)
library(ggplot2)

data("df_gaussian")
dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)

# Prior guess for the number of relevant (i.e., non-zero) regression
# coefficients:
p0 <- 5
# Number of observations:
N <- nrow(dat_gauss)
# Hyperprior scale for tau, the global shrinkage parameter (note that for the
# Gaussian family, 'rstanarm' will automatically scale this by the residual
# standard deviation):
tau0 <- p0 / (D - p0) * 1 / sqrt(N)
( D <- sum(grepl("^X", names(dat_gauss))) )

set.seed(5078022)
refm_fit <- stan_glm(
  y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 + X11 + X12 + X13 + X14 +
    X15 + X16 + X17 + X18 + X19 + X20,
  family = gaussian(),
  data = dat_gauss,
  prior = hs(global_scale = tau0),
  ### Only for the sake of speed (not recommended in general):
  chains = 2, iter = 1000,
  ###
  refresh = 0
)


refm_obj <- get_refmodel(refm_fit)

# Fast with validate_search = FALSE
cvvs_fast <- cv_varsel(
  refm_obj,
  validate_search = FALSE,
  nterms_max = 20
)

plot(cvvs_fast, alpha=0.1, deltas=TRUE,
     text_angle=45, size_position = 'primary_x_top') +
  geom_vline(xintercept=seq(0,20,by=5), colour='black', alpha=0.1)

# For comparison purposes slow with validate_search = TRUE and no subloo
cvvs <- cv_varsel(
  cvvs_fast,
  validate_search = TRUE
)

plot(cvvs, alpha=0.1, deltas=TRUE, show_cv_proportions=FALSE,
     text_angle=45, size_position = 'primary_x_top') +
  geom_vline(xintercept=seq(0,20,by=5), colour='black', alpha=0.1)

# Subsampling LOO with difference estimator and simple resampling
# starts with the simple resampling. For demonstration, we use a
# subsample of nloo=25. nloo should not be too small, but in this case
# the full N is only 100, so we use quite small nloo for demonstration
# purposes.
nloo=25
set.seed(68)
# simple random sample
sub_inds=sample(1:N, nloo)

# For subsampling LOO with difference estimator we need surrogate
# results that are close to full loo results. In case of search we can
# expect that the fast submodel LOO with validate_search=FALSE to be a
# good surrogate.
#
cvvs_fast_perf <- performances(cvvs_fast)
cvvs_perf <- performances(cvvs)
# For Intercept only and the full model the validate_search=TRUE
# results are the same as with validate_search=FALSE.
elpd_diffe <- numeric()
k <- 0
elpd_diffe[k+1] = sum(cvvs_fast$summaries$sub[[k+1]]$lppd)
k <- 20
elpd_diffe[k+1] = sum(cvvs_fast$summaries$sub[[k+1]]$lppd)
elpd_sub <- elpd_diffe
# For the rest we use the difference estimator
for (k in 1:(20-1)) {
  # proxy is the the corresponding validate_search=FALSE result
  elpd_loo_hat <- cvvs_fast$summaries$sub[[k+1]]$lppd
  # For demonstration purposes, instead of running the search with
  # subsampling, we subsample results from the run with
  # validate_search=TRUE and nloo=N
  elpd_loo_sub <- cvvs$summaries$sub[[k+1]]$lppd[sub_inds]
  # The difference estimator
  elpd_diffe[k+1] <- sum(elpd_loo_hat) + N/nloo * sum(elpd_loo_sub - elpd_loo_hat[sub_inds])
  # As comparison compute the simple subsample average
  elpd_sub[k+1] <- N/nloo * sum(elpd_loo_sub)
}

# Compare 1) validate_search=FALSE elpd, 2) validate_search=TRUE with
# subsampling difference estimator elpd, 3) validate_search=TRUE with
# subsampling with subsample average
data.frame(k=0:20,
           fast=cvvs_fast_perf$submodels[,'elpd']-cvvs_perf$submodels[,'elpd'],
           diffe=elpd_diffe-cvvs_perf$submodels[,'elpd'],
           simple=elpd_sub-cvvs_perf$submodels[,'elpd'])|>
  pivot_longer(cols = !k,
               names_to = 'method',
               values_to = 'elpd') |>
  ggplot(aes(x=k, y=elpd, color=method)) +
  geom_line() +
  labs(y='difference to slow elpd')

image

avehtari avatar Dec 28 '23 16:12 avehtari

loo package has diff_srs_est() function, which estimates also the variance (needed for SEs), but the code has some mismatch with the corresponding paper, and the estimated variance can be negative, so waiting https://github.com/stan-dev/loo/issues/237 to be solved, before continuing here

avehtari avatar Dec 29 '23 15:12 avehtari

Thanks for the clarifications. Now it's clear to me what projpred is supposed to do in case of subsampled LOO.

fweber144 avatar Jan 11 '24 14:01 fweber144

Concerning

loo_subsample_pps returns the weights, but they are not used elsewhere in the code

I think these weights (called wcv in the code) are used at some places:

  • In auc(): https://github.com/stan-dev/projpred/blob/9860c0e609cc94c406e96081640766fd8431523e/R/misc.R#L69
  • In get_stat(): https://github.com/stan-dev/projpred/blob/9860c0e609cc94c406e96081640766fd8431523e/R/summary_funs.R#L286-L287 (calls of get_stat() with non-NULL argument wcv are: https://github.com/stan-dev/projpred/blob/9860c0e609cc94c406e96081640766fd8431523e/R/summary_funs.R#L215-L217 https://github.com/stan-dev/projpred/blob/9860c0e609cc94c406e96081640766fd8431523e/R/summary_funs.R#L241-L246).

I just wanted to add this for the sake of completeness. As you said, simple random sampling (SRS, combined with the difference estimator) doesn't require weights.

fweber144 avatar Jan 24 '24 20:01 fweber144