loo New functions for better support for different scores and metrics

This is related to issues #223, #220, #213, #201, #135, #106

We have discussed rewriting CRPS and SCRPS functions to use better computation (probability weighted moment form). As new functions are needed, this has lead to thinking overall improvement of metric and score functions. Some thoughts

Keep the current loo for elpd, as it does not have arguments for the predictions and observations
a) Deprecate loo_predictive_metric and create new loo_predictive_measure, or b) create new loo_predictive_score
- (a) might be better if we change the output
- Noa thinks we could drop loo_
- I think we could drop predictive_ but should not drop loo_
For all measures, return a loo object with pointwise and estimates
- With pointwise information, we can do model comparisons
- Should we extend the current loo object or make a new object type?
- One challenge might be the sub sampling loo, which increase the amount of of work when implementing other measures
Extend loo_compare to work with other measures, as we know how to compute diff_se for all current metrics and scores. Currently, loo_predictive_metric returns only estimate and se for one model predictions
- Or do we need a new function
Include MAE, RMSE, MSE, R2, ACC, balanced ACC, and Brie score to metrics
Include RPS, SRPS, CRPS, SCRPS, and log score to scores
When computing S?C?RPS or current metrics, maybe store function specific diagnostics?
All measures need psis object or log weights, and thus should we always compute p_loo, too? Or should we compute measure specific value describing amount of fitting?
I have a minimal function that computes loo versions of RPS, SRPS, CRPS, SCRPS and includes documentation for the equations with references (but no argument checking, and computes it only for one scalar y, etc)

tagging @n-kall and @jgabry

Feb 01 '25 18:02 avehtari

I think dropping "loo_" could only make sense if the functions could also calculate the measure for the in sample case not just the loo case

Feb 01 '25 18:02 n-kall

Functions can calculate in sample if the weights are equal, and we want to make it easy to compute in sample

Feb 01 '25 18:02 avehtari

Additional notes

Design goals

Easier use of other measures than log score for LOO-CV
- unify the interface
- create loo object for all measures
Allow use of other measures than log score for in-sample, test data, and K-fold-CV use
- unify the interface
- create a performance object with pointwise results?
All measures available via one common function and as user facing individual functions
Model comparison with other measures than log score
- use the loo / performance objects with pointwise results
Improved r_eff, log_lik, psis_object handling
- always start with creating a psis_object which includes r_eff information if necessary information has been provided?
- allow passing loo object instead of psis object?
Separate leave-one-group-out object for joint scores?
- we do recommend joint scores
Better diagnostic messages
- Pareto-k for the normalization term vs. function specific in E_loo
Unify with bayesplot

Reminder how the most important functions look like now

loo() for computing elpd_loo

loo(x, ..., r_eff = 1, save_psis = FALSE)

where x can be array, matrix, or function for log likelihood.

E_loo() for various expectations (and quantiles)

E_loo(x, psis_object, type = c("mean", "variance", "sd", "quantile"), probs = NULL, log_ratios = NULL)

where x is a numeric vector (S) or matrix (SxN) (any quantity of interest)

loo_predictive_metric() for assessing point predictions

loo_predictive_metric(x, y, log_lik, ..., metric = c("mae", "rmse", "mse", "acc", "balanced_acc"), r_eff = 1)

where x is a numeric matrix (SxN) of predictions and y is a numeric vector (N) of observations

crps(), scrps(), loo_crps(), loo_scrps() for continuous ranked probability score

loo_crps(x, x2, y, log_lik, permutations = 1, r_eff = 1)

where x and x2 are vector (S) or matrix (SxN) of draws from the predictive distribution, and y a vector (N) of observations or a single value.

elpd() for K-fold-CV and test data log score

elpd(x, ...)

where x is log_lik array or matrix from K-fold-CV or test data

loo_compare() for model comparison

loo_compare(x, ...)

where an object of class "loo" or a list of such objects

str(loo_object) to remind what is there

List of 10
 $ estimates  : num [1:3, 1:2] -107.097 2.769 214.195 5.653 0.545 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3] "elpd_loo" "p_loo" "looic"
  .. ..$ : chr [1:2] "Estimate" "SE"
 $ pointwise  : num [1:71, 1:5] -1.08 -3.36 -1.25 -1.2 -1.21 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:5] "elpd_loo" "mcse_elpd_loo" "p_loo" "looic" ...
 $ diagnostics:List of 3
  ..$ pareto_k: num [1:71] 0.149 0.194 0.141 0.143 0.146 ...
  ..$ n_eff   : num [1:71] 2699 1740 3694 3387 3422 ...
  ..$ r_eff   : num [1:71] 0.685 0.664 0.952 0.867 0.876 ...
 $ psis_object: NULL
 $ elpd_loo   : num -107
 $ p_loo      : num 2.77
 $ looic      : num 214
 $ se_elpd_loo: num 5.65
 $ se_p_loo   : num 0.545
 $ se_looic   : num 11.3
 - attr(*, "dims")= int [1:2] 4000 71
 - attr(*, "class")= chr [1:3] "psis_loo" "importance_sampling_loo" "loo"
 - attr(*, "yhash")= chr "eca8bfd2af038c52c6a126a518346f86cabe2f1b"
 - attr(*, "model_name")= chr "fit_lin"

We could always start with calling loo()

my_loo <- loo(log_lik, ..., r_eff = 1, save_psis = TRUE)

and then always pass psis_object or loo_object for other functions.

predictive_measure() would allow different scores and metrics except log score

pred_meas1 <- predictive_measure(y, ypred, psis_object, measure=c("RPS", "SRPS", "CRPS", "SCRPS", "mae", "rmse", "mse", "R2", "acc", "balanced_acc"))

If psis_object is null then non PSIS weighting is done, and it is assumed that yrep is in-sample, test-sample, K-fold-CV based. r_eff is part of psis_object. The order of two first arguments is the same as in bayesplot::pp_check(), except we use ypred instead of yrep because it isn't always draws from the posterior predictive distribution. It can be draws from the predictive distribution (scores), draws from the posterior of the mean of the predictive distribution, or if psis_object is null they could be point estimates.

We could also allow adding things to a one loo object

my_loo <- add_measure(my_loo, y, ypred, measure=c("RPS", "SRPS", "CRPS", "SCRPS", "mae", "rmse", "mse", "R2", "acc", "balanced_acc"))

Additional measure results would be added as additional columns, rows, and list items in $estimates, $pointwise, and $diagnostics

With several measures, the printed output could look like this

Computed from 4000 posterior draws and 71 observations

         Estimate   SE
elpd_loo   -107.1  5.7
p_loo         2.8  0.5
SCRPS_loo   -23.7  1.2
R2_loo       0.25 0.02
------

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

When comparing models, show the order only with one measure

predictive_compare(pred_meas1, pred_meas2, measure="R2")

If all arguments have the same measure stored, print the difference and diff_se (plus new diagnostics). measure argument allows selecting which measure to use.

Computed from 4000 draws and 71 observations

Model      R2_diff se_diff
fit_lin    0.0       0.0   
fit_lin_t -0.01      0.001

For some measures like RMSE and R2, there is no pointwise RMSE and R2, and we store instead pointwise squared residuals and squared differences between observations and their mean, and then when doing the pointwise comparison compute the uncertainty for the desires measure difference. We have these difference computations in projpred (and should also write a short document with all the equations). The diagnostics could include function specific k-hats when E_loo is used.

Feb 09 '25 15:02 avehtari

@avehtari I think you said in the Stan meeting today that you had made some new comments on this, but I don't see additional comments. Or did I misunderstand what you said in the meeting? No rush, just wanted to make sure I didn't miss anything.

Feb 13 '25 21:02 jgabry

More additional notes.

We want to support PSIS-LOO, brute force LOO, K-fold-CV, independent test data, in-sample performance
- Use the same structure as the current "loo" class, but make small variants so that the diagnostics and printing make sense
- we could call this "performance" or "predperf" object
Above there was a proposal a) to have common _measure functions for metrics and scores, and b) allow computation of several measures at once. This doesn't work as scores need draws from the predictive distribution (ypred) and metrics need draws from the posterior or point estimate (mupred). Thus it's better to separate them.
We could allow a score or metric to be a function at least for scores and metrics where the total score or metric is sum or mean of pointwise scores/metrics

When using PSIS-LOO we could always start with calling loo()

my_loo <- loo(log_lik, ..., r_eff = 1, save_psis = TRUE)

save_psis is making the loo object bigger, but not more than size of log_lik. rstanarm and brms could still continue to not save psis object by default, with the additional computation time for other measures

predictive_score() would allow different scores except log score

my_loo2 <- predictive_score(y, ypred, predperf_object=my_loo, score=c("RPS", "SRPS", "CRPS", "SCRPS"))

where ypred are draws from the predictive distribution. If predperf_object argument is given a loo object or psis_object, PSIS-LOO computation is used, abd the returned object is loo object.

predictive_metric() would allow different scores except log score

my_loo2 <- predictive_metric(y, mupred, predperf_object=my_loo, metric=c("mae", "rmse", "mse", "R2", "acc", "balanced_acc")))

where mupred are draws from the posterior distribution. If predperf_object argument is given a loo object or psis_object, PSIS-LOO computation is used, abd the returned object is loo object. We could also keep adding things to the same object (here my_loo2 has log score and *RPS, and we add R^2)

my_loo3 <- predictive_metric(y, mupred, predperf_object=my_loo2, metric=c("R2"))

We could also not give predperf_object or if that is NULL

my_predperf1 <- predictive_score(y, ypred, score=...)
my_predperf1 <- predictive_score(y, ypred, predperf_object=NULL, score=...)
my_predperf2 <- predictive_metric(y, mupred, metric=...)
my_predperf2 <- predictive_metric(y, mupred, predperf_object=NULL, metric=...)

in which case the non weighted score/metric is computed (for K-fold-CV, independent test data, in-sample performance) and the returned object is predperf object (which doesn't contain PSIS-LOO).

We could also add to non-PSIS-LOO preperf object, in which case the non-weighted computation is used.

my_predperf3 <- predictive_metric(y, mupred, predperf_object=my_predperf1, metric=...)

If score or metric is a function it would be similar as now supported llfun approach, but it needs to have arguments for y and ypred or mupred. The function would need to be able to provide the name of the score/metric. It would fill the information to the pointwise array, and then the generic loo functionality would create corresponding estimate and se.

Feb 14 '25 08:02 avehtari

@jgabry I had written the comment, clicked preview, got distracted, and thought I had clicked also "Comment". The text was already written, and now clicked "Comment"

Feb 14 '25 08:02 avehtari

Good point about scores and metrics requiring different inputs. I hadn't thought enough about that. What you now wrote looks good, ~~maybe we would still want helper functions for adding metrics / scores to existing predperf objects to help with piping, but that could also be a later addition as it could just call the same function with reordered arguments~~ edit: it was pointed out that _ can be used to specify which argument to pipe to

Feb 15 '25 07:02 n-kall

After discussion with @n-kall, here's yet another revision of the design

It would be good to allow also measure="logscore", but that requires log-densities, and instead of creating more functions, maybe it's better to have more named arguments.
It would be good to make it more clear when PSIS is used for computation, which means it might be better to have separate functions for non-PSIS and PSIS computation.

pred_measure() would be used for in-sample, independent test data, and K-fold-CV

pred_measure(y = NULL,
                   ypred = NULL,
                   mupred = NULL,
                   ylp = NULL,
                   predperf = NULL,
                   measure = c("logscore", "elpd", "r2", "rps", "crps", "scrps", mae", "rmse", "mse", "acc", "energy"),
                   group_ids = NULL
)

where depending on which measures are to be computed, some combination of y, ypred, mupred, and ylp needs to be given, and those which are not provided get default value NULL. ylp starts with y similar to ypred and lp refers to log predictive density/probability (plain lp might be get confused with lp__ and log_lik is a bit misleading when the target is predictive measure). predperf object would be similar to current loo object, but they could not mixed, and here predperf object can't be a loo object. As a needed combination of y, ypred, mupred, and ylp can be provided, we can compute many measures and just check that for scores there are y and ypred and for metrics there are y and mupred. group_ids would be useful when using joint logscore or energy score (multivariate CRPS), and it could be a vector (one group_id per obs), could be a list (multiple group_ids per obs)

loo_pred_measure() would be used for PSIS-LOO computed estimates

loo_pred_measure(y = NULL,
                       ypred = NULL,
                       mupred = NULL,
                       ylp = NULL,
                       loo = NULL,
                       measure = c("logscore", "elpd", "r2", "rps", "crps", "scrps", mae", "rmse", "mse", "acc", "energy"),
                       group_ids = NULL,
                       psis_object = NULL,
                       save_psis = NULL
)

The arguments are the same except instead of predperf object, loo or psis object can be given. If neither of these is given, but ylp is given then that works as log_lik and psis object is created internally. save_psis would control whether the psis_object is also stored in the returned object.

In both functions, ypred, mupred, and ylp could also be functions.

@jgabry , what do you think?

Feb 17 '25 19:02 avehtari

I think it looks pretty good. A few thoughts/questions:

Do we think it would be more confusing for the user to have different functions for scores and metrics, or the same function but they have to figure out which of the many arguments they need to specify?
In one of the the comments above you say

We could also keep adding things to the same object (here my_loo2 has log score and *RPS, and we add R^2) my_loo3 <- predictive_metric(y, mupred, predperf_object=my_loo2, metric=c("R2"))

This looks like it's creating a new object (my_loo3) instead of adding to the existing object (my_loo2). Is that intentional? That is, is the idea that we would create a new object that contains everything in the original object plus the new measure, or is the idea that we would return the original object but now also containing the new measure information?

Feb 18 '25 17:02 jgabry

Regarding point 1: One idea behind the single function (from my side anyway) was that a user could compute several measures with just one call (by providing y, ypred, mupred and ylp) without worrying too much about which measure is a score and which is a metric. But you bring up a good point that if a user just wants to use one specific measure (or a couple from the same type) then they need to know which of the arguments to specify. We would then likely need to have clear messages guiding users

Feb 18 '25 17:02 n-kall

Good point about computing several with just one call. And yeah, we'll definitely need good documentation and good informative error messages to help the users figure out which arguments they need.

Feb 18 '25 18:02 jgabry

Above, the latest suggestion was to have functions loo_pred_measure() and pred_measure. For backward compatibility loo_pred_measure() would still return an object like

List of 10
 $ estimates  : num [1:3, 1:2] -107.097 2.769 214.195 5.653 0.545 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3] "elpd_loo" "p_loo" "looic"
  .. ..$ : chr [1:2] "Estimate" "SE"
 $ pointwise  : num [1:71, 1:5] -1.08 -3.36 -1.25 -1.2 -1.21 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:5] "elpd_loo" "mcse_elpd_loo" "p_loo" "looic" ...
 $ diagnostics:List of 3
  ..$ pareto_k: num [1:71] 0.149 0.194 0.141 0.143 0.146 ...
  ..$ n_eff   : num [1:71] 2699 1740 3694 3387 3422 ...
  ..$ r_eff   : num [1:71] 0.685 0.664 0.952 0.867 0.876 ...
 $ psis_object: NULL
 $ elpd_loo   : num -107
 $ p_loo      : num 2.77
 $ looic      : num 214
 $ se_elpd_loo: num 5.65
 $ se_p_loo   : num 0.545
 $ se_looic   : num 11.3
 - attr(*, "dims")= int [1:2] 4000 71
 - attr(*, "class")= chr [1:3] "psis_loo" "importance_sampling_loo" "loo"
 - attr(*, "yhash")= chr "eca8bfd2af038c52c6a126a518346f86cabe2f1b"
 - attr(*, "model_name")= chr "fit_lin"

The difference would be that $estimates could have additional rows and $pointwise additional columns corresponding to other measures. In case of all of MSE, RMSE and R^2, $pointwise would store pointwise squared errors, from which different estimates can be derived. $diagnostics could include h-specific Pareto-k-hats. loo_pred_measure() has a new argument group_ids which might be useful to store in the returned object, too.

Currently, elpd() function returns an object like

> 
List of 2
 $ estimates: num [1:2, 1:2] -80.26 160.52 3.24 6.47
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "elpd" "ic"
  .. ..$ : chr [1:2] "Estimate" "SE"
 $ pointwise: num [1:32, 1:2] -2.34 -2.12 -2.3 -2.16 -2.07 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:2] "elpd" "ic"
 - attr(*, "dims")= int [1:2] 1000 32
 - attr(*, "class")= chr [1:2] "elpd_generic" "loo"

which has the same $estimates and $pointwise as the loo object. This one has class elpd_generic. New pred_measure() function can use other measures, so maybe the additional class should be predperf? I guess we would keep the class loo as referring to loo package. pred_measure() would not know based on just the arguments whether the input is for in-data, test data, or brute force cross-validation, but maybe that information could be included with an additional argument, e.g. type="test", or type="kfold"? Maybe that type argument should be even obligatory, so that no-one would use loo_compare() to compare in-data and test data performance?

Mar 24 '25 13:03 avehtari

Currently, loo_compare() is used to compare elpd's. The output is like

 'compare.loo' num [1:2, 1:8] 0 0 0 0 -3775 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:2] "fitt5" "fitt5"
  ..$ : chr [1:8] "elpd_diff" "se_diff" "elpd_loo" "se_elpd_loo" ...

We can add other measures to as additional columns. The object could store also the information whether it was created by comparing objects created with loo_pred_measure() or with pred_measure() (and the type in latter).

The equations for computing SEs for differences of various measures are currently in an Overleaf document (and implemented in projpred).

It would be useful to store also some of the diagnostic information from the individual loo/predperf objects, so that when printing the comparison it would be possible to provide these diagnostic information, too. Currently the output looks like

       elpd_diff se_diff
model3   0.0       0.0  
model2 -32.0       0.0  
model1 -64.0       0.0

We could add a third column note, which would list diagnostic footnote numbers like "1,3" and then the footnote would

1. There were many high Pareto-k values when computing PSIS-LOO estimate.
3. There are some observations that might be outliers.

Mar 24 '25 13:03 avehtari

Based on the discussion in the last meeting, here are some additional clarifications

In https://github.com/stan-dev/loo/issues/281#issuecomment-2663964062 I suggested the following function signatures

pred_measure(y = NULL,
                   ypred = NULL,
                   mupred = NULL,
                   ylp = NULL,
                   predperf = NULL,
                   measure = c("logscore", "elpd", "r2", "rps", "crps", "scrps", mae", "rmse", "mse", "acc", "energy"),
                   group_ids = NULL
)

loo_pred_measure(y = NULL,
                       ypred = NULL,
                       mupred = NULL,
                       ylp = NULL,
                       loo = NULL,
                       measure = c("logscore", "elpd", "r2", "rps", "crps", "scrps", mae", "rmse", "mse", "acc", "energy"),
                       group_ids = NULL,
                       psis_object = NULL,
                       save_psis = NULL
)

In the following N is the number of observations and S is the number of posterior draws.

These are quite different from existing signatures:

loo(x, ...) assumes x is an array or matrix of log likelihood values, or a function which can compute log_lik given additional data argument. No y needs to be provided (except in data for function approach).
crps(x, x2, y, ...) assumes x and x2 are matrix or array of draws from the predictive distribution, which is called ypred in the new proposal. We don't need x2 with better crps computation. y is data as in the new proposal.
loo_predictive_metric(x, y, log_lik, ...) assumes x is a matrix of predictions, and in most cases we would assume these would be denoted as mupred in the new proposal.
E_loo(x, psis_object, log_ratios) assumes x is a numeric vector or matrix, with the number of draws matching the psis_object or log_lik.

In the new proposal

y is assumed to be an array or vector. That is, in the beginning we support only single output, and for example, for categorical model, y is factor type.
ypred is assumed to be draws from the predictive distribution as an array or matrix with size N x S. In the future we can allow it to be a function that can produce predictions, and then optional data argument needs to be given.
mupred is assumed to be draws from the location parameter of the predictive distribution as an array or matrix with size N x S. In case of pred_measure(), mupred can also be readily computed pointwise predictions as vector of length N, or an array or matrix with size N x 1 (we assume that loo_pred_measure() is always doing PSIS and thus readily computed pointwise prediction is not allowed).
ylp is log predictive density / probability values as an array or matrix with size N x S. In the future we can allow it to be a function that can produce the values (as current loo() supports), and then optional data argument needs to be given. In case of pred_measure(), we could allow also ylp to be readily computed vector of length N, or an array or matrix with size N x 1.

The steps of the computation inside pred_measure() and loo_pred_measure()

compute pointwise measures (and store in predperf or loo object)
compute summaries: estimate and SE (and store in predperf or loo object)

Currently loo() does the computation using matrices, except when x is a function and then the computation is done using *apply (possible in parallel)
Currently crps() does the computation using matrices
loo_predictive_metric() calls E_loo() and after that the metrics get y and yhat vectors
E_loo() does the computation using vapply

So in the computation step 1. we need to make a choice whether to make the computation using matrices or vapply. I think that vapply would be a more modular approach, so that each additional measure is simpler to implement. This might also make multivariate measures for group of observations easier as again each multivariate measure needs to know that on specific thing. vapply approach would also be natural when extending to allowing ypred, mupred, ylp being functions. Computation using matrices is likely to have lower overhead, and we might include special matrix based computation for the most used measure (log-score / elpd). @jgabry and @n-kall what you think?

Note that for all MSE, RMSE, and R^2 we would store pointwise MSE, and thus they can all call the same pointwise function. If pointwise MSE has already been computed, it does not need to be recomputed unless we assume the user has changed y, ypred, or mupred.

Only after we have pointwise measure values stored in predperf or loo object, we compute summaries. Estimates and how to compute SE depend on the measure. To compute RMSE and R^2 summaries, we use the pointwise MSEs.

ping also @VisruthSK

Aug 28 '25 08:08 avehtari

I agree with Aki's suggestions, making the code more modular at this stage (using vapply) would, I think, be cleaner. And then later transition to matrices for some calculations

Aug 29 '25 13:08 n-kall