cmdstan allow list of excluded or included parameters for saving

Summary:

Match the RStan feature that lets you indicate which parameters to save by name (not by index within name).

I'm not sure if this is something that's best to filter down at the Stan level. @syclik?

Current Version:

v2.15.0

May 15 '17 19:05 bob-carpenter

Hi, is there any progress? Without this, I cannot comfortably use transformed parameters and generated quantities block in a little complicated model... (I need to write all the big variables in model and generated quantities block twice.)

Aug 29 '18 00:08 MatsuuraKentaro

We've been getting asked about this a lot for CmdStanR (since it's available in RStan and a nice feature). Any hope for this in CmdStan so we could expose it in CmdStanR/Py?

Here's the most recent request for this but there have been many others:

https://discourse.mc-stan.org/t/cmdstan-sample-exclude-parameters/23141

Jun 23 '21 22:06 jgabry

I think this is going to interact really poorly with the standalone generated quantities. (My guess is that it'll cause segfaults... I've already seen it happen as-is.)

Before deciding to do this, maybe we should consider whether we're ok with breaking that existing feature?

Jun 24 '21 02:06 syclik

@bob-carpenter, sorry! I just saw that I was tagged with this in the original post! I missed that notification. (I wonder what else I missed.)

Jun 24 '21 13:06 syclik

I think this is going to interact really poorly with the standalone generated quantities. (My guess is that it'll cause segfaults... I've already seen it happen as-is.)

Ah good point, I guess this could result in variables needed for standalone gqs not being saved (is that the issue?).

On the other hand, I think it's reasonable to require that the user make sure to save all variables they need for running standalone gqs later.

Jun 24 '21 16:06 jgabry

I think standalone GQ would just error that a parameter is missing.

Jun 24 '21 16:06 rok-cesnovar

I think standalone GQ would just error that a parameter is missing.

Oh, that's great. In that case I think it's fine to add this feature! If standalone GQ has an informative error message this doesn't seem like a problem.

Jun 24 '21 16:06 jgabry

We've been getting asked about this a lot for CmdStanR (since it's available in RStan and a nice feature). Any hope for this in CmdStan so we could expose it in CmdStanR/Py?

The need to filter output variables is rarely needed in CmdStan firstly because output is flushed after each iteration and not buffered into memory and secondly because standard command line tools allow for the efficient slicing and dicing of the CSV before piping into other command line tools for analysis. Critically the CmdStan output design was based on this command line environment and not supporting bootstrapped interfaces that don’t necessarily have access to these same tools. In my opinion features like this are the responsibility of the bootstrapped interfaces using CmdStan in an non-intended way not CmdStan itself.

Overall I’m worried that more and more of the CmdStan development is motivated by supporting CmdStanR and CmdStanPy, for which it was never intended, and not actual command line use. Jonah and I talked about this a long time ago but what’s really needed here is an intermediate, programmatic C++ interface that can serve as the basis for interfaces like CmdStanR and CmdStanPy without affecting the command line user experience.

Jun 28 '21 15:06 betanalpha

I know this is not the place for this discussion to continue, but let me start by saying that I fully agree that a fully programmatic interface for wrapper consumption would help a lot, we do have to realize that we have limited resources in terms of developer time and until we can make it work this way without compromising CmdStan or Stan services, I see no real issues here.

While a lot of issues, feature request and bug fixes have been added/fixed due to CmdStanR/py, I really do not have a feeling many if any features have been added specifically due to CmdStanR/Py. There certainly is more light shed on CmdStan with more exposure via wrappers, but the features added are all things that would have been done if CmdStan itself was more used (in my opinion at least, these are things I would want to have if using CmdStan directly). The only pull request that could potentially be seen this way is https://github.com/stan-dev/cmdstan/pull/1010 but even there I think the use case for CmdStan is there.

I think there are use cases for this particular issue as well. Just recently a colleague had to move a transformed parameter container to the model block to avoid wasting a ton of disk space. It certainly is not a very pressing need, which is also evidenced by the lack of action here, but I do not think there is anything particularly wrong if this feature was added.

Jun 28 '21 17:06 rok-cesnovar

I think standalone GQ would just error that a parameter is missing.

No. Currently, this will cause segfaults in a lot of conditions! ccing: @rok-cesnovar and @jgabry

Jun 28 '21 20:06 syclik

what’s really needed here is an intermediate, programmatic C++ interface that can serve as the basis for interfaces like CmdStanR and CmdStanPy without affecting the command line user experience.

Yeah I do still agree with this! It would be great if we could find a developer to work on this, but I'm not sure who at the moment. Until then I think wrapping CmdStan like we're doing has been a big positive for the Stan community, and I don't actually think it's having a drastic affect on CmdStan development in a negative way (see next comment).

Overall I’m worried that more and more of the CmdStan development is motivated by supporting CmdStanR and CmdStanPy

As far as I know this is not actually the case, although I understand the concern. I'm not aware of any CmdStan development that wouldn't have also been good for CmdStan even if the wrappers didn't exist (e.g. we've uncovered some bugs that should have been fixed anyway and I don't recall any new features added that don't also benefit regular CmdStan users). I could be wrong, though. Do you have anything particular in mind? Looking through the closed CmdStan issues since the introduction of the wrappers I really don't see anything to suggest this is a problem, but I may have missed something.

No. Currently, this will cause segfaults in a lot of conditions! ccing: @rok-cesnovar and @jgabry

If that's the case then I guess if this were to be implemented we'd need to also update standalone gqs so it doesn't segfault

Jun 30 '21 22:06 jgabry

I forgot to also say that this issue was opened in 2017, so clearly not motivated by the introduction of the CmdStanX wrappers. I just happened to notice that it was still not implemented in CmdStan when I wanted to add it to the wrappers, but the feature was originally requested for CmdStan itself. (This is unrelated to whether it's worth implementing, just pointing out that it wasn't because of the wrappers.)

Jun 30 '21 22:06 jgabry

Hi. Is the implementation of this feature still not on the table? My friendship with Stan started from brms. I was using "default" backend RStan before I realized CmdStanR backed can be more efficient for big models. I quickly realized some limitations of brms, especially with latent variables or missing observations and big data samples. I'm trying to use CmdStanR directly for model definition and fitting to overcome these limitations. I'm sure that pro users will find an efficient way to write stan code from scratch for any problem at hand. But users like me are probably using pre-generated stan code from brms (or other wrappers) as rewriting everything is to difficult. Saving unnecessary parameters with big data sets can take a significant proportion of all processing time and is wasteful. I am using CmdStanR instead of RStan because of better efficiency and this missing feature is in my way :)

I don't think this will be implemented on brms nor on CmdStanR side so the only question is if it will be implemented on CmdStan side in any foreseeable future?

Mar 29 '22 13:03 mantas-t

Saving unnecessary parameters with big data sets can take a significant proportion of all processing time and is wasteful.

actually, I/O isn't a big deal w/r/t processing time - it's the gradient calculations that dominate. but yes, it takes up disk space and makes downstream processing difficult.

Mar 29 '22 13:03 mitzimorris

I have a proposal open for language-level annotations on the code, one use case of which would be an @silent annotation on a parameter: https://github.com/stan-dev/design-docs/pull/45

Mar 29 '22 13:03 WardBrian

actually, I/O isn't a big deal w/r/t processing time - it's the gradient calculations that dominate. but yes, it takes up disk space and makes downstream processing difficult.

I'm using variational inference/variation bayes as HMC would take months and Pathfinder is not operational yet. So writing millions of unnecessary parameters to CSV is significant compared to VB but of course not deadly.

Mar 29 '22 13:03 mantas-t

Excluding certain variables from the saved output could be useful beyond interfacing with wrappers, e.g., for sharing code between models without polluting the output.

For concreteness, suppose we are considering a number of different models, say with different latent parameters for age, education, age-education interaction, state or residence, etc. To ensure these different models are evaluated consistently, I often add the following two blocks (the second should be the same in each Stan file unless we change the observation model).

transformed parameters {
    // Predictor for regression or `logits` for classification.
    vector [n] y_hat = ...;
}

generated quantities {
    // Log likelihood for each observation, e.g., for WAIC.
    vector [n] log_lik;
    for (i in 1:n) {
        log_lik[i] = ...;
    }
}

Often, I am not concerned with y_hat, but I care about log_lik and would like to exclude the former from the output. E.g., suppose we are fitting the classic CBS News poll dataset for the 1988 presidential election with ~11.5k observations from Gelman and Hill's book on multilevel regression (cf. example implementation). The number of parameters is relatively small (about 100 accounting for different states and various interactions), and the output is dominated by y_hat and log_lik amounting to ~23k values. The additional disk and memory use is a bit cumbersome, as already mentioned by @mitzimorris. We also suffer performance issues when calling stansummary or diagnose. In the CBS News example, it takes about 1 mins 30 secs to draw samples on my machine but 4 mins 10 secs to call stansummary and diagnose.

I have considered the following options:

Move the y_hat evaluation to the generated quantities and wrap it in {} to remove it from the output. Copy-paste the generated quantities block between different models, only change y_hat, and ensure log_lik is the same across all models.
Move the log_lik evaluation into a separate file and #include it in the generated quantities block. But that requires y_hat to be in scope which, in turn, means it's written to disk.

Being able to exclude y_hat from the output would allow the log_lik evaluation to be refactored and moved to a separate file without doubling the output size. In the example above, we'd be able to reduce the runtime from 5 mins 40 secs to about 3 mins 35 secs.

Aug 30 '23 04:08 tillahoffmann

Thanks for the detailed example @tillahoffmann. Yeah I still think this would be a good feature for CmdStan (or potentially more generally using @WardBrian ’s annotations idea)

Sep 08 '23 16:09 jgabry

I intend to take a pass at this via annotations sometime soon

Sep 08 '23 16:09 WardBrian

cmdstan cmdstan copied to clipboard

allow list of excluded or included parameters for saving

Summary:

Current Version:

cmdstan
cmdstan copied to clipboard