statastan
statastan copied to clipboard
Having a separate Stata program to export data in R dump format
I feel having a separate Stata program to export in-memory data in R dump format would be great. Users could then choose to run CmdStan either via the Stata -shell- command directly, or via the StataStan module.
Also, an alternative to having Stata write the data with -file write- is to import it into R with the package haven and then use rstan::stan_rdump to convert. Extra dependencies for StataStan but make StataStan even more lightweight potentially.
For example, I have the following R script in my working directory, which reads in rdump.dta and export rdump.data.R.
library(haven)
library(rstan)
# read .dta
rdump <- read_stata("rdump.dta")
# turn the data frame into separate vector columns
for(i in 1:length(names(rdump))) {
varname <- names(rdump)[i]
assign(varname, as.vector(rdump[[i]]))
}
# write .data.R
N <- dim(rdump)[1]
stan_rdump("N", "rdump.data.R")
stan_rdump(names(rdump), "rdump.data.R", append = TRUE)
The hardcoded file names are not that annoying actually. An alternative is to have Stata write the R script with a macro for the filename but that's too much work. Easier would be to just save an extra copy of the in-memory data and name it accordingly in Stata. Then all the conversion can be done in a do file.
use schools.dta, clear
save rdump.dta, replace
shell rscript rdump.R
* one can then run CmdStan via the -shell- command
shell make stan_schools_model
shell ./stan_schools_model sample data file=rdump.data.R
You're asking for a three way Stata-Stan-R interface, which frankly is just not going to happen here, however sympathetic I am to the cause to nudging Stata people towards using other software. But if you fork statastan and do it, I will advertise it (as Peter Tosh said).
I'm suggesting it might give StataStan users more flexibility to have a dta_to_dump.ado (that only exports the data in the dump format) in addition to stan.ado (that also runs the model and loads the Stan output in one go). As an example, one could then export a potentially large .dta to the dump format only once and then run different Stan models on it, instead of exporting the same data again every time the model changes.
The use of stan_rdump to create dump files is actually mentioned in the User's Guide to CmdStan, in C.1. Creating Dump Files:
Dump files can be created from R using RStan. The function is stan_rdump in package rstan. Using R’s native dump() function can produce dump files which Stan cannot read in. The underlying cause is that R gets creative in the format it uses for output, only being constrained to something that can be executed in R. So it will write the array containing the values 1, 2, 3, 4 as 1:4 rather than as c(1,2,3,4).
The dump format just doesn't seem very friendly to deal with, and knowing that stan_rdump can take care of that it seems natural to try to get the .dta file into R and then use it. I included the example to show just what it'd be like to do the conversion with stan_rdump. It seems for you the costs of the dependencies clearly outweighs the benefits. I wonder why?
Yes, having something write the data dump format would be a good idea. You don't need a dependency on R to do that.
The dump format was a colossally bad decision on my part in the beginning. I thought R's dump format was simpler than it was, but R's crazy (as I mentioned in not quite so many words in the manual bit you quote). But our dump format is well defined and easy to generate, but given that it's only a subset of what R does by default, we had to write the special dump function in RStan.
We're going to be replacing it over time, we just haven't decided with what (we'll maintain backward compatibility for the old format).
- Bob
On Apr 6, 2016, at 9:20 AM, felixleungsc [email protected] wrote:
I'm suggesting it might give StataStan users more flexibility to have a dta_to_dump.ado (that only exports the data in the dump format) in addition to stan.ado (that also runs the model and loads the Stan output in one go). As an example, one could then export a potentially large .dta to the dump format only once and then run different Stan models on it, instead of exporting the same data again every time the model changes.
The use of stan_rdump to create dump files is actually mentioned in the User's Guide to CmdStan, in C.1. Creating Dump Files:
Dump files can be created from R using RStan. The function is stan_rdump in package rstan. Using R’s native dump() function can produce dump files which Stan cannot read in. The underlying cause is that R gets creative in the format it uses for output, only being constrained to something that can be executed in R. So it will write the array containing the values 1, 2, 3, 4 as 1:4 rather than as c(1,2,3,4).
The dump format just doesn't seem very friendly to deal with, and knowing that stan_rdump can take care of that it seems natural to try to get the .dta file into R and then use it. I included the example to show just what it'd be like to do the conversion with stan_rdump. It seems for you the costs of the dependencies clearly outweighs the benefits. I wonder why?
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub
Is this format readable back into CmdStan or Stata? If neither, my opinion stays the same. But if it is, then we'll add it to the list.
PS you know the chains go into a CSV file right? Do you want other things like the CmdStan parameters?
Yes, it's the format used by CmdStan for data and parameter inits. So you must already have something to write it out to file if you allow the data to come in from Stata.
- Bob
On Apr 6, 2016, at 3:19 PM, Robert Grant [email protected] wrote:
Is this format readable back into CmdStan or Stata? If neither, my opinion stays the same. But if it is, then we'll add it to the list.
PS you know the chains go into a CSV file right? Do you want other things like the CmdStan parameters?
— You are receiving this because you commented. Reply to this email directly or view it on GitHub
The Feather format based on Apache Arrow looks promising as another shared file format across the Stan interfaces. Very fast read/write and the Arrow developers surely want to make it the de-facto standard for columnar data processing.
Implementations in Python and R, and Julia are already underway. They're all enabled by the same C++ library (https://github.com/wesm/feather/tree/master/cpp), which I guess might play nicely with Stan. For Stata and Matlab, a C++ plugin is also feasible.
Indeed there is Stata code to write that. Is there documentation of the rdump contents? I am guessing it has the data (which is currently written out to datafile()), the initial values, and the CmdStan arguments like stepsize?
We only use the dump format for data input and initialization input.
The output's all in the CV and comments. This is what we're refactoring now.
The doc for the dump format is in the CmdStan manual for the format.
- Bob
On Apr 7, 2016, at 5:16 AM, Robert Grant [email protected] wrote:
Indeed there is Stata code to write that. Is there documentation of the rdump contents? I am guessing it has the data (which is currently written out to datafile()), the initial values, and the CmdStan arguments like stepsize?
— You are receiving this because you commented. Reply to this email directly or view it on GitHub
My thinking is: StataStan has to focus energy on getting Stata people into using Stan, and I can see the merits of writing out the CmdStan settings. The Stata way to do that would be to stick them into r(). Stata has no native capacity for reading R data dumps so that's why it doesn't feel like a priority to me. I still see it as a three way interface.
The format we're talking about is CmdStan's input format. It was modeled on R, and can be read by R, but R's own dump() function writes out things that CmdStan can't read. So there's a special function in R to write output in a way that CmdStan can read it in, called stan_rdump().
You must already have a writer for this format, because it's the only way to get data into CmdStan. Is it a lot of work to expose it to the user? As you say, it would only be useful for creating files to be read in as part of a call to Stan within Stata or for creating files in Stata to be read directly into CmdStan.
- Bob
On Apr 11, 2016, at 3:27 PM, Robert Grant [email protected] wrote:
My thinking is: StataStan has to focus energy on getting Stata people into using Stan, and I can see the merits of writing out the CmdStan settings. The Stata way to do that would be to stick them into r(). Stata has no native capacity for reading R data dumps so that's why it doesn't feel like a priority to me. I still see it as a three way interface.
— You are receiving this because you commented. Reply to this email directly or view it on GitHub
Ah... cometh the dawn. I thought there must be more to it than that. Yes, that does exist and as you say we could expose it with an option to dump only and not run Stan. At present that makes a data file and an inits file, but they could be concatenated. OK, consider it on the list. Sorry for being dense.
Glad we're on the same page now. And by all means decide how to prioritize your work here.
The R function just takes an array of variable names and dumps those. So if there'd be a way to take a list of named variables, that'd be most general.
I've been experimenting with Mata, and put together a Mata function to dump data. @robertgrant Would you be able to take a look? It can only handle scalars and vectors now, but can certainly be extended to matrices and globals quite easily. It would also make checking user's inputs in MATrices(string) and GLobals(string) more robust with the help of st_matrix and st_global and hence allow for better error messages.
Below is a minimal example:
* a mata function to write in-memory data in dump format
mata:
mata clear
function dta2dump(v, v_name, fn) {
// the next line turns v, a column vector, into a string with a comma between values.
d_out = subinstr(invtokens(strofreal(transposeonly(v))), " ", ", ")
fh = fopen(fn, "a")
fput(fh, v_name + " <- c(" + d_out + ")" + sprintf("\n"))
fclose(fh)
}
end
* an example
sysuse auto
* get a list of (only) numeric variables...
ds, has(type numeric)
local numeric_var `r(varlist)'
putmata `numeric_var'
* and write to an empty file.
if c(os) == "Windows" shell copy NUL test_dump.data.R
else shell > test_dump.data.R
foreach var in `numeric_var' {
mata dta2dump(`var', "`var'", "test_dump.data.R")
}
Cool. It's currently done in Stata in lines 288-412 of stan.ado (master branch, f8c2814). We need to be able to write out globals (scalars) and matrices too. I'd love to write arrays of arbitrary dimensionality but that's not such a Stata thing. Maybe Mata could help there?
You can always code up an n-dimensional matrix by "melting" it into an N+1-column data frame, with columns for each index position and an extra column for value:
INDEXES value i j k l m 1 1 1 1 1 x[1,1,1,1,1]
- Bob
On Apr 12, 2016, at 5:23 AM, Robert Grant [email protected] wrote:
Cool. It's currently done in Stata in lines 288-412 of stan.ado (master branch, f8c2814). We need to be able to write out globals (scalars) and matrices too. I'd love to write arrays of arbitrary dimensionality but that's not such a Stata thing. Maybe Mata could help there?
— You are receiving this because you commented. Reply to this email directly or view it on GitHub
Thanks for the tip Bob. I imagine we don't need multidimensional array to write out the globals and matrices. But if we do, we now know how.
The "dump" function can already handle column vector; for matrices we need only to vectorize it. And should be straightforward for globals too.
Mata has this st_dir() function that lists all objects (globals and matrices, etc), and so potentially StataStan could just dump all globals and matrices by default (so less typing for the user with a slightly bigger dump file), and allow for the option to specify them exactly.
I'd be very happy for you to assign this issue to me, and am sure I'll need your advice on how best to incorporate it with StataStan. What I'll do is extend the function to support globals and matrices, and we can go from there.
Lastly, I think it'd be good to also have a function to read dump files. That way, StataStan could run all the examples on https://github.com/stan-dev/example-models; and even include them in the help file!
On Thu, Apr 14, 2016 at 3:00 AM Bob Carpenter [email protected] wrote:
You can always code up an n-dimensional matrix by "melting" it into an N+1-column data frame, with columns for each index position and an extra column for value:
INDEXES value i j k l m 1 1 1 1 1 x[1,1,1,1,1]
- Bob
On Apr 12, 2016, at 5:23 AM, Robert Grant [email protected] wrote:
Cool. It's currently done in Stata in lines 288-412 of stan.ado (master branch, f8c2814). We need to be able to write out globals (scalars) and matrices too. I'd love to write arrays of arbitrary dimensionality but that's not such a Stata thing. Maybe Mata could help there?
— You are receiving this because you commented. Reply to this email directly or view it on GitHub
— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/stan-dev/statastan/issues/13#issuecomment-209546206
I'll let @robertgrant manage the issue assignments.
CmdStan does read the dump files, so all that's really needed is direct plumbing from the StataStan command.
Sure.
Yes, while StataStan can already run all the examples via CmdStan (without even loading the data into Stata), it'd make learning Stan easier for Stata users if they can do basic interaction with the data in the many examples, and a read functionality would be a must-have(?) if they want to delve any deeper.
On Thu, Apr 14, 2016 at 10:29 AM Bob Carpenter [email protected] wrote:
I'll let @robertgrant https://github.com/robertgrant manage the issue assignments.
CmdStan does read the dump files, so all that's really needed is direct plumbing from the StataStan command.
— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/stan-dev/statastan/issues/13#issuecomment-209700556
That would be great!
On Apr 13, 2016, at 10:17 PM, felixleungsc [email protected] wrote:
Sure.
Yes, while StataStan can already run all the examples via CmdStan (without even loading the data into Stata), it'd make learning Stan easier for Stata users if they can do basic interaction with the data in the many examples, and a read functionality would be a must-have(?) if they want to delve any deeper.
On Thu, Apr 14, 2016 at 10:29 AM Bob Carpenter [email protected] wrote:
I'll let @robertgrant https://github.com/robertgrant manage the issue assignments.
CmdStan does read the dump files, so all that's really needed is direct plumbing from the StataStan command.
— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/stan-dev/statastan/issues/13#issuecomment-209700556
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/stan-dev/statastan/issues/13#issuecomment-209724582
I just meant that the functionality's already there for StataStan to plug into --- StataStan calls CmdStan on the back end.
On Apr 13, 2016, at 10:17 PM, felixleungsc [email protected] wrote:
Sure.
Yes, while StataStan can already run all the examples via CmdStan (without even loading the data into Stata), it'd make learning Stan easier for Stata users if they can do basic interaction with the data in the many examples, and a read functionality would be a must-have(?) if they want to delve any deeper.
On Thu, Apr 14, 2016 at 10:29 AM Bob Carpenter [email protected] wrote:
I'll let @robertgrant https://github.com/robertgrant manage the issue assignments.
CmdStan does read the dump files, so all that's really needed is direct plumbing from the StataStan command.
— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/stan-dev/statastan/issues/13#issuecomment-209700556
— You are receiving this because you commented. Reply to this email directly or view it on GitHub
Oh I misunderstood what you said earlier. I got you now~ On Thu, 14 Apr 2016 at 13:22, Bob Carpenter [email protected] wrote:
I just meant that the functionality's already there for StataStan to plug into --- StataStan calls CmdStan on the back end.
On Apr 13, 2016, at 10:17 PM, felixleungsc [email protected] wrote:
Sure.
Yes, while StataStan can already run all the examples via CmdStan (without even loading the data into Stata), it'd make learning Stan easier for Stata users if they can do basic interaction with the data in the many examples, and a read functionality would be a must-have(?) if they want to delve any deeper.
On Thu, Apr 14, 2016 at 10:29 AM Bob Carpenter <[email protected]
wrote:
I'll let @robertgrant https://github.com/robertgrant manage the issue assignments.
CmdStan does read the dump files, so all that's really needed is direct plumbing from the StataStan command.
— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub < https://github.com/stan-dev/statastan/issues/13#issuecomment-209700556>
— You are receiving this because you commented. Reply to this email directly or view it on GitHub
— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/stan-dev/statastan/issues/13#issuecomment-209739653
It's sounding good. I'm heading out on holiday until Wednesday. Let's pick it up then in terms of how it appears to the Stata user.
@robertgrant Sorry for the delay ...
This is what I've got so far: (1) a write_dump.ado, which just calls the mata function write_dump(), and (2) a test-write_dump.do, which contains some examples to test (1). By default, write_dump writes out every numeric variable in memory, every matrix and scalar. I have for now chosen not to write any globals because instead e.g. "scalar N = r(N)" will check that anything but a scalar is assigned to N. The fork is here: https://github.com/felixleungsc/statastan/tree/feature/issue-13-write-dump.
While write_dump is working now (though it doesn't have a help file), I have yet to make stan.ado call write_dump and not break. Will work on that soon.
Just wanted to say I haven't forgotten to take a proper look into this, I'm just swatting the wasps of medical research right now. But in general, I still think we should include extra functionality a options inside stan.ado wherever possible, so users don't have to download more stuff from SSC.
Having a separate ado file for write_dump wouldn't change the way users install the interface, as ssc install stan would download all files associated with the package, just like e.g. ssc install egenmore, which downloads dozens of ado files. So, installation-wise it'd be just as easy regardless of how the code is structured.
Usage-wise, you've probably seen the examples in test-write_dump.do, which have the syntax:
write_dump using test_dump.data.R
(which by default appends every numeric variable in memory, every matrix and scalar to test_dump.data.R). And I think you have in mind something like this:
stan *, datafile(test_dump.data.R) matrices(all) globals(all) write_only.
But essentially both would do what lines 288-412 in stan.ado are already doing: dump the data. In fact, the second syntax can be implemented and issue #13 fixed without rewriting lines 288-412 at all. Just add a write_only option to the syntax and wrap the whole stan part in stan.ado in a if (NOT write_only) { [lines 415-724] }. That's what I would do, and on top of that I've rewritten lines 288-412 in mata and put that in write_dump.ado and want to have stan.ado call it by replacing lines 288-412 with:
write_dump varlist' matrices' globals' N using datafile', replace.
That way, both syntaxes can be supported.
Sounds good. I have a few bugs noticed in Win***s that need fixing on Monday, then I'm going to adopt the convention of develop, release & hotfix branches so I will give you a shout when I'm done and yours can be the first contribution to develop.
Great! I will then re-fork stan-dev/statastan, branch off from develop, and merge the changes back in. As per suggested.
In the forked repo now, stan.ado calls write_dump.ado. Before I made those changes, I first ran stan-example.do and put the results in log_stan_example.txt. I then incorporated the changes and re-ran the examples again. Other than the random seeds and the numerical values, the results are identical (see git diff). I have not tested stan.ado any further.
Also I feel it'd be better to store any scalars with e.g. scalar N = r(N). If you agree, I will in the syntax change the global() option to scalar().
It's all yours! I'm thinking of having write_dump.ado, as well as documenting the keepfiles option as another way of having S+/R-format data and inits after running. I'm happy to send scalars to Stan but I'd rather not remove the globals capability in case we pee off people who are already using it. iirc the reason i used globals and not locals was that there's a stata command that returns a list of global names, so you'd want to find something like that for scalars.
Brilliant! I see. I won't touch the global option then, and will just add one for the scalars.
Btw I think we also need a develop branch for stan-dev/statastan.. On Tue, 24 May 2016 at 18:51, Robert Grant [email protected] wrote:
It's all yours! I'm thinking of having write_dump.ado, as well as documenting the keepfiles option as another way of having S+/R-format data and inits after running. I'm happy to send scalars to Stan but I'd rather not remove the globals capability in case we pee off people who are already using it. iirc the reason i used globals and not locals was that there's a stata command that returns a list of global names, so you'd want to find something like that for scalars.
— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/stan-dev/statastan/issues/13#issuecomment-221205970
@robertgrant stan-dev/statastan doesn't yet have a develop branch, which I would need when submitting a pull request. Could you create one please? Thanks :)