cmdstan
cmdstan copied to clipboard
allow list of excluded or included parameters for saving
Summary:
Match the RStan feature that lets you indicate which parameters to save by name (not by index within name).
I'm not sure if this is something that's best to filter down at the Stan level. @syclik?
Current Version:
v2.15.0
Hi, is there any progress? Without this, I cannot comfortably use transformed parameters and generated quantities block in a little complicated model... (I need to write all the big variables in model and generated quantities block twice.)
We've been getting asked about this a lot for CmdStanR (since it's available in RStan and a nice feature). Any hope for this in CmdStan so we could expose it in CmdStanR/Py?
Here's the most recent request for this but there have been many others:
https://discourse.mc-stan.org/t/cmdstan-sample-exclude-parameters/23141
I think this is going to interact really poorly with the standalone generated quantities. (My guess is that it'll cause segfaults... I've already seen it happen as-is.)
Before deciding to do this, maybe we should consider whether we're ok with breaking that existing feature?
@bob-carpenter, sorry! I just saw that I was tagged with this in the original post! I missed that notification. (I wonder what else I missed.)
I think this is going to interact really poorly with the standalone generated quantities. (My guess is that it'll cause segfaults... I've already seen it happen as-is.)
Ah good point, I guess this could result in variables needed for standalone gqs not being saved (is that the issue?).
On the other hand, I think it's reasonable to require that the user make sure to save all variables they need for running standalone gqs later.
I think standalone GQ would just error that a parameter is missing.
I think standalone GQ would just error that a parameter is missing.
Oh, that's great. In that case I think it's fine to add this feature! If standalone GQ has an informative error message this doesn't seem like a problem.
We've been getting asked about this a lot for CmdStanR (since it's available in RStan and a nice feature). Any hope for this in CmdStan so we could expose it in CmdStanR/Py?
The need to filter output variables is rarely needed in CmdStan firstly because output is flushed after each iteration and not buffered into memory and secondly because standard command line tools allow for the efficient slicing and dicing of the CSV before piping into other command line tools for analysis. Critically the CmdStan output design was based on this command line environment and not supporting bootstrapped interfaces that don’t necessarily have access to these same tools. In my opinion features like this are the responsibility of the bootstrapped interfaces using CmdStan in an non-intended way not CmdStan itself.
Overall I’m worried that more and more of the CmdStan development is motivated by supporting CmdStanR and CmdStanPy, for which it was never intended, and not actual command line use. Jonah and I talked about this a long time ago but what’s really needed here is an intermediate, programmatic C++ interface that can serve as the basis for interfaces like CmdStanR and CmdStanPy without affecting the command line user experience.
I know this is not the place for this discussion to continue, but let me start by saying that I fully agree that a fully programmatic interface for wrapper consumption would help a lot, we do have to realize that we have limited resources in terms of developer time and until we can make it work this way without compromising CmdStan or Stan services, I see no real issues here.
While a lot of issues, feature request and bug fixes have been added/fixed due to CmdStanR/py, I really do not have a feeling many if any features have been added specifically due to CmdStanR/Py. There certainly is more light shed on CmdStan with more exposure via wrappers, but the features added are all things that would have been done if CmdStan itself was more used (in my opinion at least, these are things I would want to have if using CmdStan directly). The only pull request that could potentially be seen this way is https://github.com/stan-dev/cmdstan/pull/1010 but even there I think the use case for CmdStan is there.
I think there are use cases for this particular issue as well. Just recently a colleague had to move a transformed parameter container to the model block to avoid wasting a ton of disk space. It certainly is not a very pressing need, which is also evidenced by the lack of action here, but I do not think there is anything particularly wrong if this feature was added.
I think standalone GQ would just error that a parameter is missing.
No. Currently, this will cause segfaults in a lot of conditions! ccing: @rok-cesnovar and @jgabry
what’s really needed here is an intermediate, programmatic C++ interface that can serve as the basis for interfaces like CmdStanR and CmdStanPy without affecting the command line user experience.
Yeah I do still agree with this! It would be great if we could find a developer to work on this, but I'm not sure who at the moment. Until then I think wrapping CmdStan like we're doing has been a big positive for the Stan community, and I don't actually think it's having a drastic affect on CmdStan development in a negative way (see next comment).
Overall I’m worried that more and more of the CmdStan development is motivated by supporting CmdStanR and CmdStanPy
As far as I know this is not actually the case, although I understand the concern. I'm not aware of any CmdStan development that wouldn't have also been good for CmdStan even if the wrappers didn't exist (e.g. we've uncovered some bugs that should have been fixed anyway and I don't recall any new features added that don't also benefit regular CmdStan users). I could be wrong, though. Do you have anything particular in mind? Looking through the closed CmdStan issues since the introduction of the wrappers I really don't see anything to suggest this is a problem, but I may have missed something.
No. Currently, this will cause segfaults in a lot of conditions! ccing: @rok-cesnovar and @jgabry
If that's the case then I guess if this were to be implemented we'd need to also update standalone gqs so it doesn't segfault
I forgot to also say that this issue was opened in 2017, so clearly not motivated by the introduction of the CmdStanX wrappers. I just happened to notice that it was still not implemented in CmdStan when I wanted to add it to the wrappers, but the feature was originally requested for CmdStan itself. (This is unrelated to whether it's worth implementing, just pointing out that it wasn't because of the wrappers.)
Hi. Is the implementation of this feature still not on the table? My friendship with Stan started from brms. I was using "default" backend RStan before I realized CmdStanR backed can be more efficient for big models. I quickly realized some limitations of brms, especially with latent variables or missing observations and big data samples. I'm trying to use CmdStanR directly for model definition and fitting to overcome these limitations. I'm sure that pro users will find an efficient way to write stan code from scratch for any problem at hand. But users like me are probably using pre-generated stan code from brms (or other wrappers) as rewriting everything is to difficult. Saving unnecessary parameters with big data sets can take a significant proportion of all processing time and is wasteful. I am using CmdStanR instead of RStan because of better efficiency and this missing feature is in my way :)
I don't think this will be implemented on brms nor on CmdStanR side so the only question is if it will be implemented on CmdStan side in any foreseeable future?
Saving unnecessary parameters with big data sets can take a significant proportion of all processing time and is wasteful.
actually, I/O isn't a big deal w/r/t processing time - it's the gradient calculations that dominate. but yes, it takes up disk space and makes downstream processing difficult.
I have a proposal open for language-level annotations on the code, one use case of which would be an @silent annotation on a parameter: https://github.com/stan-dev/design-docs/pull/45
actually, I/O isn't a big deal w/r/t processing time - it's the gradient calculations that dominate. but yes, it takes up disk space and makes downstream processing difficult.
I'm using variational inference/variation bayes as HMC would take months and Pathfinder is not operational yet. So writing millions of unnecessary parameters to CSV is significant compared to VB but of course not deadly.
Excluding certain variables from the saved output could be useful beyond interfacing with wrappers, e.g., for sharing code between models without polluting the output.
For concreteness, suppose we are considering a number of different models, say with different latent parameters for age, education, age-education interaction, state or residence, etc. To ensure these different models are evaluated consistently, I often add the following two blocks (the second should be the same in each Stan file unless we change the observation model).
transformed parameters {
// Predictor for regression or `logits` for classification.
vector [n] y_hat = ...;
}
generated quantities {
// Log likelihood for each observation, e.g., for WAIC.
vector [n] log_lik;
for (i in 1:n) {
log_lik[i] = ...;
}
}
Often, I am not concerned with y_hat, but I care about log_lik and would like to exclude the former from the output. E.g., suppose we are fitting the classic CBS News poll dataset for the 1988 presidential election with ~11.5k observations from Gelman and Hill's book on multilevel regression (cf. example implementation). The number of parameters is relatively small (about 100 accounting for different states and various interactions), and the output is dominated by y_hat and log_lik amounting to ~23k values. The additional disk and memory use is a bit cumbersome, as already mentioned by @mitzimorris. We also suffer performance issues when calling stansummary or diagnose. In the CBS News example, it takes about 1 mins 30 secs to draw samples on my machine but 4 mins 10 secs to call stansummary and diagnose.
I have considered the following options:
- Move the
y_hatevaluation to thegenerated quantitiesand wrap it in{}to remove it from the output. Copy-paste thegenerated quantitiesblock between different models, only changey_hat, and ensurelog_likis the same across all models. - Move the
log_likevaluation into a separate file and#includeit in thegenerated quantitiesblock. But that requiresy_hatto be in scope which, in turn, means it's written to disk.
Being able to exclude y_hat from the output would allow the log_lik evaluation to be refactored and moved to a separate file without doubling the output size. In the example above, we'd be able to reduce the runtime from 5 mins 40 secs to about 3 mins 35 secs.
Thanks for the detailed example @tillahoffmann. Yeah I still think this would be a good feature for CmdStan (or potentially more generally using @WardBrian ’s annotations idea)
I intend to take a pass at this via annotations sometime soon