cmdstanr Add a function for generating initial values from a CmdStanMCMC object, csv files or a draws object

Submission Checklist

[X ] Run unit tests
[ X] Declare copyright holder and agree to license (see below)

Summary

I initially built these functions for another package I'm working on last week. But yesterday I saw #876, so I decided to share them. I did see @SteveBronder is already working on it, but since I had them ready, here's a PR for them - if you decide to go with the other approach, no worries!

This PR adds a generic method generate_inits which creates a lists of lists that can be used to pass to the init argument for sampling.

There are three specific methods for generating inits from CmdStanMCMC object, a vector of csv files with draws output, or a draws object.

The methods each takes as arguments:

variables: specify which variables to generate inits for
FUN: a function that takes a vector of draws for a parameter and outputs a single value. The default is to sample 1 random value from the draws, but it can be any function such as mean, median, etc
The method for CmdStanMCMC also takes an argument draws with options "last", "sampling", "all". If the argument is "last", it uses a new function I wrote read_last_draws which retrieves the last draw recorded from the csv files. This does not use the builting read_cmdstan_csv function, but a similar method. This can be useful for very large fits - it takes 0-2 seconds regardless of the size of the files

The general logic is that for each chain, it:

retrieves the draws
applies the function specified by FUN to each parameter
reformats the result into a list of lists by infering the dimensions for each parameter

All the methods are documented, and include unit tests.

Examples

inits from a CmdStanMCMC object

stanfit <- cmdstanr::cmdstanr_example("logistic")
inits1 <- generate_inits(stanfit)
inits2 <- generate_inits(stanfit, FUN = mean)
inits3 <- generate_inits(stanfit, FUN = quantile, probs = 0.5)
inits4 <- generate_inits(stanfit, draws = "last")
str(inits1)

#> List of 4
#>  $ :List of 2
#>   ..$ alpha: num 0.469
#>   ..$ beta : num [1:3(1d)] -0.776 -0.423 0.393
#>  $ :List of 2
#>   ..$ alpha: num 0.22
#>   ..$ beta : num [1:3(1d)] -0.334 -0.162 0.475
#>  $ :List of 2
#>   ..$ alpha: num 0.153
#>   ..$ beta : num [1:3(1d)] -0.692 -0.331 0.463
#>  $ :List of 2
#>   ..$ alpha: num 0.396
#>   ..$ beta : num [1:3(1d)] -0.7717 -0.0398 0.3815

inits from a vector of file paths

files <- stanfit$output_files()
generate_inits(files)

inits from a draws object

draws <- stanfit$draws()
generate_inits(draws)

warmup and then use the final draws for the inits of a separate sampling stage

warmup <- cmdstanr_example("logistic", parallel_chains = 4, iter_sampling = 0, save_warmup = T)
inits <- generate_inits(warmup, draws = "last")
mod <- cmdstan_model(exe_file = warmup$runset$exe_file())
fit <- mod$sample(warmup$data_file(),
                  parallel_chains = 4,
                  init = inits,
                  iter_warmup = 0,
                  inv_metric = warmup$inv_metric(matrix = FALSE),
                  step_size = warmup$metadata()$step_size_adaptation,
                  adapt_engaged = FALSE)

   variable   mean median   sd  mad     q5    q95 rhat ess_bulk ess_tail
 lp__       -66.06 -65.69 1.56 1.33 -69.06 -64.27 1.00     2020     2383
 alpha        0.38   0.38 0.22 0.22   0.02   0.76 1.00     4141     2964
 beta[1]     -0.67  -0.66 0.25 0.25  -1.10  -0.26 1.00     4307     3175
 beta[2]     -0.27  -0.27 0.23 0.23  -0.66   0.10 1.00     3956     3006
 beta[3]      0.68   0.67 0.28 0.27   0.23   1.14 1.00     4183     2273
 log_lik[1]  -0.51  -0.51 0.10 0.10  -0.69  -0.36 1.00     4143     2797
 log_lik[2]  -0.40  -0.38 0.15 0.14  -0.68  -0.19 1.00     4318     2762
 log_lik[3]  -0.50  -0.46 0.22 0.21  -0.92  -0.21 1.00     4368     3021
 log_lik[4]  -0.45  -0.43 0.16 0.15  -0.74  -0.23 1.00     3889     3043
 log_lik[5]  -1.19  -1.17 0.29 0.28  -1.68  -0.76 1.00     4594     2657

# compare with standard fitting with combined warmup and sampling
fit_standard <- mod$sample(warmup$data_file(),
                           parallel_chains = 4)

   variable   mean median   sd  mad     q5    q95 rhat ess_bulk ess_tail
 lp__       -66.01 -65.69 1.48 1.29 -68.81 -64.28 1.00     2186     3008
 alpha        0.38   0.38 0.22 0.22   0.02   0.75 1.00     4170     2867
 beta[1]     -0.67  -0.67 0.25 0.24  -1.09  -0.27 1.00     3536     3062
 beta[2]     -0.27  -0.27 0.23 0.22  -0.65   0.11 1.00     4264     3301
 beta[3]      0.69   0.69 0.27 0.28   0.25   1.15 1.00     4175     3179
 log_lik[1]  -0.51  -0.51 0.10 0.10  -0.69  -0.36 1.00     4055     3100
 log_lik[2]  -0.40  -0.38 0.15 0.14  -0.67  -0.20 1.00     4626     3386
 log_lik[3]  -0.49  -0.46 0.22 0.20  -0.90  -0.20 1.00     4306     3164
 log_lik[4]  -0.45  -0.43 0.16 0.15  -0.73  -0.23 1.00     3885     2900
 log_lik[5]  -1.20  -1.18 0.29 0.28  -1.71  -0.76 1.00     4678     3246

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Vencislav Popov

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

Mar 14 '24 06:03 venpopov

This doesn't have a method for pathfinder, but since generate_inits is a generic, a method could be added if you like this approach for the general framework

Mar 14 '24 07:03 venpopov

Thank you very much @venpopov! @SteveBronder min reviewing this, given that you are already working on something similar?

Mar 14 '24 09:03 rok-cesnovar

Codecov Report

Attention: Patch coverage is 78.75000% with 17 lines in your changes are missing coverage. Please review.

Project coverage is 86.56%. Comparing base (28329d7) to head (14ec040).

:exclamation: Current head 14ec040 differs from pull request most recent head d729b32. Consider uploading reports for the commit d729b32 to get more accurate results

Files	Patch %	Lines
R/inits.R	78.75%	17 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #930      +/-   ##
==========================================
- Coverage   88.32%   86.56%   -1.77%     
==========================================
  Files          12       13       +1     
  Lines        4549     4629      +80     
==========================================
- Hits         4018     4007      -11     
- Misses        531      622      +91

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Mar 14 '24 10:03 codecov-commenter

Very cool thank you for the contribution!

The issue with pathfinder is that pathfinder is the only Stan algorithm exposed in cmdstanr that uses cmdstan's internal threading vs. dispatching multiple cmdstan processes. So you need to generate files ending in _{1...N}.json for each pathfinder respectively. It gets weird because every other algorithm uses num_procs (the number of cmdstan processes to create) while pathfinder only needs 1 proc but with multiple init files. A lot of if statements can fix this but it starts looking really patchy.

The PR I wrote is a little more heavy handed. It rips out all of the dispatching and uses cmdstan's internal threading for all the algorithms.

But I also don't have code to work over raw draws and I like your init functions much more than mine. Would you mind if I cherry picked some of this over to the branch I'm writing and added you there as a contributor?

Mar 14 '24 14:03 SteveBronder

No, I don't mind - go ahead!

Mar 14 '24 14:03 venpopov

Closing this as implemented in #937

Apr 19 '24 04:04 andrjohns