cmdstan icon indicating copy to clipboard operation
cmdstan copied to clipboard

convert CmdStan CSV output to R dump format input

Open bob-carpenter opened this issue 8 years ago • 12 comments

Moved from https://github.com/stan-dev/stan/issues/544

In order to perform fake data simulation or posterior predictive checking, it would be nice to be able to convert the output of a Stan model from CSV format to the input for a Stan model in R dump format.

This should be structured as a command parallel to bin/print that does the conversion of an output CSV file. An alternative would be to have a model call argument that would produce R dump output.

The manual for CmdStan needs to be updated to show how to use this function. This will enable us to write a chapter in the manual on fake data and posterior predictive checks.

Be careful about type of the columns --- if there are integer generated quantities, the output can be integers.

For example, for the Bernoulli model in the introduction, a fake-data generator should look like:

data {
  int<lower=0> N;
  real<lower=0, upper=1> theta;
}
generated quantities {
  int<lower=0,upper=1> y[N];

  for (n in 1:N)
    y[n] <- bernoulli_rng(theta);
}

Related issues:

  • To run this, we need both the output of running the Bernoulli model and a value for N in order to provide input for this model
  • Doing proper posterior model generation will require empty parameters and model blocks, so update the parser so that this works (or link to a different issue); @betanalpha is working on a feature for this with a dummy sampler that can handle empty parameter vectors

bob-carpenter avatar Nov 24 '16 05:11 bob-carpenter

I wrote some Python (from scratch it's not PyStan) to do the conversion, mainly to automate simulation + sampling for the same model when using CmdStan. It could be a useful starting point, and though you wouldn't want me to port to c++ myself, it only uses NumPy so it should be quite portable.

maedoc avatar Oct 17 '17 19:10 maedoc

Thanks, @maedoc.

Now that you mention it, RStan must have all the pieces of this implemented because of the way extract() and stan_rdump() work.

bob-carpenter avatar Oct 17 '17 21:10 bob-carpenter

Sure but it's all in R code.

sakrejda avatar Oct 17 '17 21:10 sakrejda

Is this still relevant? Is a converter from the cmdstan CSV output to R dump still needed? My guess would be no.

rok-cesnovar avatar Sep 17 '19 19:09 rok-cesnovar

Something like this is needed for restarts, but I think that'd require a new command.

bob-carpenter avatar Sep 17 '19 21:09 bob-carpenter

Is this still relevant? Is a converter from the cmdstan CSV output to R dump still needed? My guess would be no.

a few use cases I've wanted this for

  • restarts e.g. when model takes longer than walltime limit on a cluster,
  • simulating data and then fitting a model to its data,
  • multiple model workflow
  • intializing HMC from an optimization

I usually end up with a mess of grep, cut, tr, nl in bash for what is a pretty simple job. Two main modes would be

  • take 1 line of sampling CSV, convert to R/json format
  • take summary csv, convert to R/json format

It'd also be useful to massage CSV to convert matrices from x.1.2 style columns to 2D ascii matrices for use with GnuPlot or similar, but that's fairly outside scope.

Is input/output in JSON now part of CmdStan? That seems like the easiest way to go. I could give it a go, since it'd be miles better than the bash equivalent.

maedoc avatar Dec 13 '19 08:12 maedoc

Once https://github.com/stan-dev/cmdstanr/pull/95 is merged to cmdstanr, you will be able to read the samples and all sampler parameters (diviergent, leapfrog, etc.. ) with read_sample_csv(filenames) in R. It outputs the following list:

list(
    sampling_info 
    inverse_mass_matrix 
    warmup 
    post_warmup 
    warmup_sampler 
    post_warmup_sampler
  )

I think this is close to what you are looking for. You cant read existing cmdstan csv files, no need to run model through cmdstanr if you dont want to.

If you feel more at home with Python then try check_sampler_csv from cmdstanpy. I think it does something similar.

rok-cesnovar avatar Dec 13 '19 09:12 rok-cesnovar

Input in JSON has been a part of Cmdstan for quite some time, we just made the input a bit faster for the last release. The ouput is still csv only however.

rok-cesnovar avatar Dec 13 '19 09:12 rok-cesnovar

I'm aware of the R/Py interfaces to CmdStan as well, but was hoping to stick with a plain Bash/Makefile setup but I think for complex workflows that's just not realistic. Munging data formats on the command line is precarious esp for matrix/array datatypes.

maedoc avatar Dec 13 '19 09:12 maedoc

On Dec 13, 2019, at 4:39 AM, marmaduke woodman [email protected] wrote:

I'm aware of the R/Py interfaces to CmdStan as well,

In case it wasn't clear to our devs not involved in CmdStanPy, the original version was derived from Marmaduke's PyCmdStan package.

bob-carpenter avatar Dec 13 '19 15:12 bob-carpenter

Oh haha :) Now I feel like a fool :blush:

rok-cesnovar avatar Dec 13 '19 15:12 rok-cesnovar

Don't feel bad---it's a big project with too much going on for any one person to follow. I'm just trying to close the loops where I see an opportunity.

On Dec 13, 2019, at 10:32 AM, Rok Češnovar [email protected] wrote:

Oh haha :) Now I feel like a fool.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

bob-carpenter avatar Dec 13 '19 15:12 bob-carpenter