SimDesign icon indicating copy to clipboard operation
SimDesign copied to clipboard

Implement sample size planning functions

Open bsiepe opened this issue 1 year ago • 2 comments

Hello Phil,

as we previously briefly discussed via email, it could be useful to have functionality that allows for sample size planning for simulation studies to achieve a desired Monte Carlo Standard Error (MCSE). These functions should allow users to specify their performance measure of interest and the desired precision. The functions then return the number of repetitions needed to achieve said precision. The calculations can be based on the formulas we provide in Siepe et al. (2023).

We (Samuel Pawel, František Bartoš, and I) would like to contribute to this functionality.

Our Suggestions:

  • implement helper functions plan_*, where * stands for performance measures such as bias or coverage as implemented in the SimDesign summary functions
  • let users either specify a 'worst-case' scenario (in case of performance measures with known SE) or an empirical variance of the estimates (based on previous simulation studies or a pilot simulation)
  • the plan_* can be used within the Summarise() function. Users can then run a pilot simulation study to obtain the empirical variance and return the required sample size for each condition/method.
  • this would seamlessly build upon existing infrastructure. We could create a wiki/vignette that explains the idea.

Sketch of what such a function could look like:

For a performance measure with known SE:

plan_EDR <- function(target_mcse,
                     target_edr = 0.5){ 
  n_rep <- target_edr * (1 - target_edr) / target_mcse^2
  n_rep
}

For a performance measure with unknown SE:

plan_bias <- function(target_mcse,
                      empirical_var){
  n_rep <- empirical_var / target_mcse^2
  n_rep
}

Depending on your input, we will open a pull request suggesting the functions soon.

Best Björn

bsiepe avatar Feb 07 '24 09:02 bsiepe

Hi Björn,

This sounds quite reasonable to me.

  • implement helper functions plan_*, where * stands for performance measures such as bias or coverage as implemented in the SimDesign summary functions

Agreed, this is a nice convention to use in the package, and the respective functions (bias() and plan_bias()) could be linked to in the package documentation.

  • let users either specify a 'worst-case' scenario (in case of performance measures with known SE) or an empirical variance of the estimates (based on previous simulation studies or a pilot simulation)

I like this idea, but worry about the use of empirical estimates for the purpose of simulation planning. As the empirical variance estimates are themselves a function of the replication size one could quite easily over/under estimate the requisite number of replications to obtain the desired precision, particularly if the replication size was initially too low. Ideally, some type of confidence interval should be included for this type of situation, where either the complete vector of observations used to obtain said empirical variance estimates is passed to the function (where obtaining internal uncertainty quantifiers could be applied, even if from the large sample normal family or via bootstrapping) or the standard error be included by the user to indicate the degree of precision in the empirical variances. I'd be fine with either.

  • *the plan_ can be used within the Summarise() function. Users can then run a pilot simulation study to obtain the empirical variance and return the required sample size for each condition/method. *

It's unclear to me why this would be necessary within a Summarise() call. As SimDesign stores the results information one could just extract the analyse results out and pass these to plan_* in raw form (see above), or reduce manually as well. Basically, if this were to be constructed with Summarise() support then utilizing the raw results data would be the most ideal path.

  • this would seamlessly build upon existing infrastructure. We could create a wiki/vignette that explains the idea.

A vignette would be great! Though let me know your thoughts about my above points before proceeding. Thanks!

philchalmers avatar Feb 10 '24 02:02 philchalmers

Hello Phil,

Thank you for your thoughts and willingness to include the idea in your package.

  1. Yes, linking the functions via the documentation is a good idea.
  2. That is a good point. We will include both a point estimate for the sample size and a lower/upper bound which is based on an uncertainty estimate for the empirical variance. The latter can then be obtained in the two ways that you mentioned. In this way, users will notice that the implied sample size may be imprecise if they only use a few replications for their pilot run.
  3. Sorry if we were unclear. Indeed, the plan_*() functions could both be used within and outside of a Summarise() call if users pass the raw results data. We just thought that using the functions within a Summarise() call might be a useful workflow for users incrementally building their simulation study, but it is not necessary.
  4. Great, we suggest first creating a pull request for the functionality itself and then creating a vignette later on when you have reviewed and possibly accepted our suggestions.

We will prepare a pull request incorporating this functionality. Please excuse that this may take some time.

bsiepe avatar Feb 21 '24 13:02 bsiepe