SimDesign
SimDesign copied to clipboard
Implement sample size planning functions
Hello Phil,
as we previously briefly discussed via email, it could be useful to have functionality that allows for sample size planning for simulation studies to achieve a desired Monte Carlo Standard Error (MCSE). These functions should allow users to specify their performance measure of interest and the desired precision. The functions then return the number of repetitions needed to achieve said precision. The calculations can be based on the formulas we provide in Siepe et al. (2023).
We (Samuel Pawel, František Bartoš, and I) would like to contribute to this functionality.
Our Suggestions:
- implement helper functions
plan_*, where * stands for performance measures such as bias or coverage as implemented in the SimDesign summary functions - let users either specify a 'worst-case' scenario (in case of performance measures with known SE) or an empirical variance of the estimates (based on previous simulation studies or a pilot simulation)
- the
plan_*can be used within theSummarise()function. Users can then run a pilot simulation study to obtain the empirical variance and return the required sample size for each condition/method. - this would seamlessly build upon existing infrastructure. We could create a wiki/vignette that explains the idea.
Sketch of what such a function could look like:
For a performance measure with known SE:
plan_EDR <- function(target_mcse,
target_edr = 0.5){
n_rep <- target_edr * (1 - target_edr) / target_mcse^2
n_rep
}
For a performance measure with unknown SE:
plan_bias <- function(target_mcse,
empirical_var){
n_rep <- empirical_var / target_mcse^2
n_rep
}
Depending on your input, we will open a pull request suggesting the functions soon.
Best Björn
Hi Björn,
This sounds quite reasonable to me.
- implement helper functions
plan_*, where * stands for performance measures such as bias or coverage as implemented in the SimDesign summary functions
Agreed, this is a nice convention to use in the package, and the respective functions (bias() and plan_bias()) could be linked to in the package documentation.
- let users either specify a 'worst-case' scenario (in case of performance measures with known SE) or an empirical variance of the estimates (based on previous simulation studies or a pilot simulation)
I like this idea, but worry about the use of empirical estimates for the purpose of simulation planning. As the empirical variance estimates are themselves a function of the replication size one could quite easily over/under estimate the requisite number of replications to obtain the desired precision, particularly if the replication size was initially too low. Ideally, some type of confidence interval should be included for this type of situation, where either the complete vector of observations used to obtain said empirical variance estimates is passed to the function (where obtaining internal uncertainty quantifiers could be applied, even if from the large sample normal family or via bootstrapping) or the standard error be included by the user to indicate the degree of precision in the empirical variances. I'd be fine with either.
- *the
plan_can be used within theSummarise()function. Users can then run a pilot simulation study to obtain the empirical variance and return the required sample size for each condition/method. *
It's unclear to me why this would be necessary within a Summarise() call. As SimDesign stores the results information one could just extract the analyse results out and pass these to plan_* in raw form (see above), or reduce manually as well. Basically, if this were to be constructed with Summarise() support then utilizing the raw results data would be the most ideal path.
- this would seamlessly build upon existing infrastructure. We could create a wiki/vignette that explains the idea.
A vignette would be great! Though let me know your thoughts about my above points before proceeding. Thanks!
Hello Phil,
Thank you for your thoughts and willingness to include the idea in your package.
- Yes, linking the functions via the documentation is a good idea.
- That is a good point. We will include both a point estimate for the sample size and a lower/upper bound which is based on an uncertainty estimate for the empirical variance. The latter can then be obtained in the two ways that you mentioned. In this way, users will notice that the implied sample size may be imprecise if they only use a few replications for their pilot run.
- Sorry if we were unclear. Indeed, the
plan_*()functions could both be used within and outside of aSummarise()call if users pass the raw results data. We just thought that using the functions within aSummarise()call might be a useful workflow for users incrementally building their simulation study, but it is not necessary. - Great, we suggest first creating a pull request for the functionality itself and then creating a vignette later on when you have reviewed and possibly accepted our suggestions.
We will prepare a pull request incorporating this functionality. Please excuse that this may take some time.