cmdstanr Consider more efficient serialization of CmdStanFit objects

Is your feature request related to a problem? Please describe.

Posterior samples take up a lot of space, and RDS scales poorly. Without compression, the files get pretty big. With gzfile compression, serialization takes a lot of time.

Describe the solution you'd like

I propose a new qs-based save method analogous to the current $save_object().

Describe alternatives you've considered

If all the data in a CmdStanFit object can be expressed as a collection of data frames, maybe fst and arrow could be alternatives, but that would require multiple files per object and thus more bookkeeping.

Nov 08 '20 22:11 wlandau

what proportion of processing time is spent in writing a StanFit object to RDS and what proportion of time is spent obtaining a sample from the CmdStan CSV files?

Nov 09 '20 00:11 mitzimorris

I take your point that HMC is usually far more time-consuming than saving and loading samples. But downstream postprocessing tends to be quick, and I tend to iterate over it more, which means time spent reading large collections of draws is noticeable.

On reflection, if depending on qs is burdensome, what about a single method just to load all the relevant output into memory so the user can choose how to save it?

https://github.com/stan-dev/cmdstanr/blob/d27994f804c493ff3047a2a98d693fa90b83af98/R/fit.R#L16-L18

Nov 09 '20 13:11 wlandau

I may be wrong, but I believe @mitzimorris wasn't talking about HMC sampling times, but the fact that at present cmdstanr has to read the samples from the CSV files that cmdstan generates, and it may be that reading time (soon to be faster) that is the primary bottleneck, at least when first accessing the data. Even so, I also think @wlandau's point is that there are use cases for re-loading the data across sessions that would be hastened by use of qs rather than rds.

Nov 09 '20 13:11 mike-lawrence

I may be wrong, but I believe @mitzimorris wasn't talking about HMC sampling times, but the fact that at present cmdstanr has to read the samples from the CSV files that cmdstan generates, and it may be that reading time (soon to be faster) that is the primary bottleneck, at least when first accessing the data.

Yes, my mistake. But CSV file processing is upstream of saving the object, so I still stand by my more general point.

Nov 09 '20 13:11 wlandau

(Sorry @wlandau, yes and I edited my response here after realizing my correction was indeed tangential to your point; I now see how the variability in whether folks view issues via email [where they don't see edits] vs on github is a reason for not using the edit feature here)

Nov 09 '20 13:11 mike-lawrence

Thanks for opening the issue @wlandau. I have no particular attachment to the RDS format, so I'm definitely open to better options.

On reflection, if depending on qs is burdensome, what about a single method just to load all the relevant output into memory so the user can choose how to save it?

I like the idea of letting the user choose how to save things and if we can do that without adding an additional dependency then that would be great. So this seems like a good approach. What do others think?

Nov 10 '20 17:11 jgabry

I agree that we need good serialization - but we need to think cross-interface and therefore find serialization formats that can be accessed by Python and C++ as well as R-specific formats.

Nov 10 '20 17:11 mitzimorris

Yeah I definitely like the idea of using formats that are compatible across the interfaces. I don't think that conflicts with adding a method to reading everything into memory though. Right now the user has to call multiple different methods to get everything into memory (draws(), sampler_diagnostics(), etc.), which is fine for interactive use but a pain if they just want to read everything in to save it. So a single method that reads everything into memory seems like a good idea to me. @mitzimorris Won't we need that anyway regardless of which serialization format is used? Or am I misunderstanding (totally possible!).

Nov 10 '20 18:11 jgabry

how are sampler_diagnostics() different from draws()? is this implemented differently in CmdStanR than in CmdStanPy?

aren't the first 7 columns of the draws array the source for sampler_diagnostics? or is this something extra computed by CmdStanR outside of CmdStan? (feature creep!)

the fundamental problem we're working against now is the way that the underlying Stan C++ code for the algos and services send all output information to a single CSV file. for backwards compatibility, we work with what we've got. at the same time, we should remember to keep a wholistic view of the workflow and redesign the I/O accordingly.

Nov 10 '20 18:11 mitzimorris

Yeah the sampler diagnostics are the first columns of the same CSV file containing the samples, but because those columns serve a different purpose than the other columns the draws() method doesn't return those columns, just the sampler_diagnostics() method. The draws() method returns the parameters, transformed parameters, and generated quantities, and the sampler_diagnositcs() method returns the diagnostic columns. Here's the original issue with code examples:

https://github.com/stan-dev/cmdstanr/issues/123

Is that not what CmdStanPy does?

Nov 10 '20 18:11 jgabry

in CmdStanPy, everything is stored in a single numpy.ndarray structure

stored column major, so all the sampler diags are contiguous in memory
views on the array will give you sampler diags or draws, (although this isn't implemented)

initial focus for CmdStanPy was truly lightweight wrapper - it was important to get info from CSV files into a data structure that had the right memory locality properties for downstream analysis. this could be easily modified to have sampler_params and model_params (or sampler_diagnostics and draws, for uniformity) - there's a few open issues already - could be done as part of https://github.com/stan-dev/cmdstanpy/issues/171

Nov 10 '20 20:11 mitzimorris

Oh I see what you mean. We're currently storing the diagnostics in one array and the other draws in a another array (once they're read in from CSV), but we could switch to storing them together if that's preferable (the method would continue to return them separately). I didn't think it would matter too much but I could definitely be wrong!

Also @rok-cesnovar any thoughts on any of this?

Nov 10 '20 20:11 jgabry

my concern is making sure interfaces aren't too different from one another - we must allow idioms appropriate to the language (i.e., "pythonic" vs "rthonic" - use of quotes quite intentional)

Nov 10 '20 20:11 mitzimorris

Yeah that makes sense, I agree we should keep them similar unless there's a good reason.

Nov 10 '20 20:11 jgabry

No strong feelings from me either way.

I am not sure many users use the sampler diagnostics apart from running the cmdstan(r) internal checks on them, but yeah, if we want to be consistent with cmdstanpy, we can return them as part of draws(). I would keep the separate $sampler_diagnostics(), that is useful also internally.

Nov 10 '20 21:11 rok-cesnovar

Oh I didn't mean we should return them as part of draws() -- I definitely think it's preferable to have draws() just return the params and generated quantities. Sorry for being unclear. We should definitely keep the user-facing behavior the same as it is now. I was referring just to how we're storing them internally.

Nov 10 '20 21:11 jgabry

for CmdStanPy, there's Arviz - https://arviz-devs.github.io/arviz/getting_started/InferenceDataCookbook.html#cookbook

the underlying storage mechanism doesn't have to be the same - the APIs should have the same features, and as far as serialization goes - what else needs to be in a serialized object? (cf above-referenced issue)

Nov 10 '20 21:11 mitzimorris

I doubt that CmdStanPy users are interested in the sampler params, so changing CmdStanPy so that draws() returns a view which omits sampler diagnostics would work.

Nov 10 '20 21:11 mitzimorris

what else needs to be in a serialized object

I might be missing something but here's what occurs to me:

param/GQ draws
sampler diagnostics
inits (if user specified, we currently don't have a way of getting them if Stan generates them unfortunately: https://github.com/stan-dev/stan/issues/2950)
mass matrix info?

Nov 10 '20 21:11 jgabry

what is the purpose of the serialized object?

for reproducibility, you want model, data, inits, seed, chain-ids, and run config info

for saved sampler state, you want stepsize and mass matrix, but RNG state is fundamentally missing

for posterior predictions, you need parameter draws plus data

Nov 10 '20 22:11 mitzimorris

Yeah good point. I think, like you said, there are various potential purposes so we might need to consider multiple options that correspond to these different purposes.

Nov 10 '20 22:11 jgabry

what is the purpose of the serialized object?

In my case, it's a part of a targets pipeline, especially with stantargets. targets is like Make for data analysis in R, it caches computations to prevent them from running unnecessarily. For example, tar_stan_mcmc() defines an upstream target with the CmdStanFit object and downstream targets with information like draws. For certain kinds of changes the user makes to the pipeline, targets figures out that the draws need to be refreshed but the model does not need to run again.

Aug 07 '22 17:08 wlandau

Coming back to this after a long time because save_object is still pretty slow (e.g. https://discourse.mc-stan.org/t/cmdstanr-save-object-takes-a-long-time/). @wlandau Are you still interested in more efficient serialization? If so, any interest in working on it at some point and making a PR?

Aug 02 '23 17:08 jgabry

Thanks for following up, @jgabry. Looking back, the feedback from you and @mitzimorris helps me understand how serialization would need to fit with the rest of the Stan ecosystem.

I like your point at https://github.com/stan-dev/cmdstanr/issues/340#issuecomment-724869645 about loading things into memory without saving to disk (i.e. to marshal but not serialize). This separation seems appealing. qs is great, but other packages might supersede it at some point.

Thinking out loud about the options I like:

marshal_object() method which is basically save_object() without saving to RDS. As part of this option, save_object() could simplify even further: call marshal_object() and then saveRDS().
A flag in save_object() to marshal only (skip saveRDS()) so the user can manually save the object.
Just document how to selectively load MCMC output into memory and efficient ways to save the object. @mitzimorris raises a great point in https://github.com/stan-dev/cmdstanr/issues/340#issuecomment-724994708 that different users will need different kinds of output. This could really matter in cases like https://discourse.mc-stan.org/t/cmdstanr-save-object-takes-a-long-time/32295/2, i.e. if the user can drop superfluous output.

Aug 07 '23 02:08 wlandau

thanks for thinking more about this!

All those seem like reasonable options. The last one seems like the simplest and something we should do regardless of whether we do either of the other options.

Between the first two options I’d slightly favor the first, only because calling a method named save_object to not actually save the object is a bit weird, so maybe a separate method would be better.

Aug 08 '23 20:08 jgabry

Worth considering here that this issue would be largely obviated by the longer-term project of cmdstan IO rewrite?

Aug 09 '23 14:08 mike-lawrence

Yeah quite possibly, although who knows when that will happen!

Aug 09 '23 16:08 jgabry

Sounds like documentation is a reasonable place to start. Okay to open a PR to edit https://mc-stan.org/cmdstanr/articles/cmdstanr.html#saving-fitted-model-objects?

Aug 09 '23 16:08 wlandau

That would be great, thank you!

Aug 09 '23 17:08 jgabry

Given how people might want to save only pieces of the output, I am not sure how appealing a marshal_object() method seems now. But I think manually marshaling the object would be easier if methods like init() and profiles() did not require a try() wrapper.

Aug 10 '23 00:08 wlandau

cmdstanr cmdstanr copied to clipboard

Consider more efficient serialization of CmdStanFit objects

cmdstanr
cmdstanr copied to clipboard