stan
stan copied to clipboard
rhat and ess updates in analyze
Summary:
In analyze service
- [ ] Offer interfaces a unified calculation of rhat
- [x] Update effective sample size calculation
- [x] Updating ess requires a small change in the autocovariance computation
Description:
#2569 and #2575 moved effective sample size computation from chains.hpp to analyze service. Move rhat calculation in the same way. In addition make minor changes to the effective sample size and rhat calculation. These changes make the computations more robust, but in most cases affect the computed values only a little. Especially these changes don't include rank normalization, folding or quantiles (these features will be added later).
The reference implementation in R is at https://github.com/stan-dev/rstan/blob/develop/rstan/rstan/R/monitor.R
- [x] autocovariance: use "biased" estimate as recommended by Geyer (1992)
- [ ] rhat: for constant finite values return 1, for non-finite values return NaN
- [x] ess: 1) use split chains, 2) add half rho of the next even lag after the truncation, 3) negative values and maximum ess are set chainsn_sampleslog10(chains*n_samples)
Reproducible Steps:
Refrence values from R implementation
upath <- system.file('unitTests', package='rstan')
f1 <- file.path(upath, 'testdata', 'blocker1.csv')
f2 <- file.path(upath, 'testdata', 'blocker2.csv')
fit <- read_stan_csv(c(f1,f2))
monitor(fit)
sims <- as.array(fit)
rh <- rhat(sims[,,1])
checkEquals(rh, 1.007794, tolerance = 0.001);
eb <- ess_bulk(sims[,,1])
checkEquals(eb, 465.3917, tolerance = 0.001);
et <- ess_tail(sims[,,1])
checkEquals(et, 349.1353, tolerance = 0.001);
em <- ess_mean(sims[,,1])
checkEquals(em, 467.8447, tolerance = 0.001);
Same blocker1.csv and blocker2.csv files are in stan repo at src/test/unit/mcmc/test_csv_files/ and have been used in unit tests in stan repo, too.
EDIT: updated ess_mean reference value.
ping @roualdes
Thanks, @avehtari. I think autocovariance is in stan-dev/math. I'll start this week to work on these.
I think autocovariance is in stan-dev/math
If it's used in elsewhere in math or exposed in Stan language, then it's better to re-implement in analyze
We had some discussion about this a long, long time ago. On one hand people argued that the math library should be just for the functions exposed to the autodiff library and on the other people argued that it should include all mathematical functionality used in Stan. That latter side won, which is why the autocorrelation and even the Welford accumulators used in the adaptation are in the math library.
That said, pushing everything into math does increase the testing and maintenance burden, where as having things contained within analyze would help distribute the burden and localize the dependences. I’d be okay going in that direction if there are no strong objections.
On Apr 21, 2019, at 9:51 PM, Aki Vehtari [email protected] wrote:
I think autocovariance is in stan-dev/math
If it's used in elsewhere in math or exposed in Stan language, then it's better to re-implement in analyze
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2752#issuecomment-485301184, or mute the thread https://github.com/notifications/unsubscribe-auth/AALU3FUUX7D4BVD6RXM6YYLPRUKY3ANCNFSM4HHKZHUA.
The reference values from the R implementation was a great idea, thank you. Just one weird issue about it. When I run the following R code,
library(rstan)
source('monitor.R')
upath <- system.file('unitTests', package='rstan')
f1 <- file.path(upath, 'testdata', 'blocker1.csv')
f2 <- file.path(upath, 'testdata', 'blocker2.csv')
fit <- read_stan_csv(c(f1,f2))
sims <- as.array(fit)
ess_mean(sims[,,1])
I get the result 467.8447. I also get this value in C++ based on my best efforts to update the function stan::analyze::compute_effective_sample_size(). However, this number does not match the number you provided, 467.3676. Would you please confirm or disconfirm the results of this R code? I'm I reading the number you provided incorrectly? Other? Thanks.
I get the result 467.8447.
I now get the same result. My previous result was by mistake without splitting the chains, ie, the result you would get from ess_rfun(sims[,,1]). Sorry for the confusion.
Thanks for double checking.
Let me know what you guys think of the following functional spec. I propose to overload stan::analyze::compte_effective_sample_size() with
double compute_effective_sample_size(std::vector<const double*> draws, std::vector<size_t> sizes)
// and
double compute_effective_sample_size(const Eigen::MatrixBase<Derived>& draws)
which attempts to accomplish two things. The first signature aligns with our discussions on Discourse. The second signature allows for composing stan::analyze::compute_effective_sample_size() with stan::analyze::split_chains(), in the same way that monitor.R does, namely compute_effective_sample_size(split_chains(...)). The arguments to stan::analyze::split_chains() follow the same logic from our Discourse discussion:
Eigen::MatrixXd split_chains(const std::vector<const double*>& draws, const std::vector<size_t>& sizes)
To enable such a separation, without needlessly copying data around, I added an (intended to be) internal function
double effective_sample_size_impl(const Eigen::MatrixBase<DerivedA>& acov, const Eigen::MatrixBase<DerivedB>& chain_mean)
I agree that this adds much complication, to what should/could be a simple PR. However, the outcome is that the interfaces can call stan::analyze::compute_effective_sample_size() and stan::analyze::split_chains() separately without being penalized for needlessly copying data around, and more importantly, this will ease the future effort to adopt ess_bulk, which necessitates the separation of stan::analyze::compute_effective_sample_size() and stan::analyze::split_chains().
What do you guys think?
Moving towards an API that facilitates chain splitting, for both ESS and Rhat calculations, is the right step forwards.
I’m wondering if internally this all can’t be handled with Eigen maps? In other words, the main API exposes functions that take in a std::vector of double arrays and the first internal step is laying out the memory into Eigen maps which can then be split/passed around without any unnecessary copying. This would have the double bonus of using the memory provided by the client.
On Apr 26, 2019, at 4:00 PM, Edward A. Roualdes [email protected] wrote:
Thanks for double checking.
Let me know what you guys think of the following functional spec. I propose to overload stan::analyze::compte_effective_sample_size() with
double compute_effective_sample_size(std::vector<const double*> draws, std::vector<size_t> sizes) // and double compute_effective_sample_size(const Eigen::MatrixBase<Derived>& draws) which attempts to accomplish two things. The first signature aligns with our discussions on Discourse https://discourse.mc-stan.org/t/analysis-api/3486/30. The second signature allows for composing stan::analyze::compute_effective_sample_size() with stan::analyze::split_chains(), in the same way that monitor.R https://github.com/stan-dev/rstan/blob/24c9962a8f425f67453bb14034d5317633830fec/rstan/rstan/R/monitor.R#L378 does, namely compute_effective_sample_size(split_chains(...)). The arguments to stan::analyze::split_chains() follow the same logic from our Discourse discussion:
Eigen::MatrixXd split_chains(const std::vector<const double*>& draws, const std::vector<size_t>& sizes) To enable such a separation, without needlessly copying data around, I added an (intended to be) internal function
double effective_sample_size_impl(const Eigen::MatrixBase<DerivedA>& acov, const Eigen::MatrixBase<DerivedB>& chain_mean) I agree that this adds much complication, to what should/could be a simple PR. However, the outcome is that the interfaces can call stan::analyze::compute_effective_sample_size() and stan::analyze::split_chains() separately without being penalized for needlessly copying data around, and more importantly, this will ease the future effort to adopt ess_bulk https://github.com/stan-dev/rstan/blob/24c9962a8f425f67453bb14034d5317633830fec/rstan/rstan/R/monitor.R#L307, which necessitates the separation of stan::analyze::compute_effective_sample_size() and stan::analyze::split_chains().
What do you guys think?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2752#issuecomment-487182981, or mute the thread https://github.com/notifications/unsubscribe-auth/AALU3FQ6BB5LFQPXU2TR6Z3PSNNN7ANCNFSM4HHKZHUA.
What do you guys think?
I'm on board because I'm the one who was suggesting something like this in the first place to have a baseline implementation. We probably need to do the same thing with our probability functions to cut down on compile times.
I’m wondering if internally this all can’t be handled with Eigen maps?
Having aT* is a necessary and sufficient condition to create an Eigen::Map<Matrix<T, ...>>. One of the motivations here is the difficulty they've had with PyStan integrating Eigen data types through Cython and the desire to provide a non-Eigen interface.
This would have the double bonus of using the memory provided by the client.
There are competing issues of speed and scalability. The roadblock is that draws come in row-by-row (draw by draw, that is), but all the analysis is column-by-column (parameter by parameter). An efficient Eigen transpose involving a copy up front is probably worthwhile versus using non-memory local Maps everywhere. So it should be faster to transpose once, but it requires twice the dynamic memory overhead as now you have the original memory and a copy.
I’m wondering if internally this all can’t be handled with Eigen maps?
I thought about and tried this, but I convinced myself that it won't work. An Eigen map is created with a pointer to a contiguous block of memory and the dimension(s). The trouble is, as far as I know, std::vector<const double*>& draws will not necessarily keep all arrays next to each other, despite the fact that the elements of each double* are contiguous.
This also means split_chains() necessarily does a copy.
If you've got 5 - 20 minutes to educate me about Eigen maps or convince yourself that I'm on to something, please do.
Here's my a link to my work-in-progress branch, which is on my fork of stan.
I wasn’t thinking of merging everything into a matrix via a map but rather handling the initial chains, or any splitting there of, as a collection of vector maps.
I guess it all comes down to how the autocorrelations are handled. Welford accumulator estimators can run on each chain segment and hence don’t need everything to be in one big contiguous block of memory. Plus the individual estimators can be computed in parallel. On the other hand FFT estimators, as are currently employed, probably need the continuity.
Anyways, to maintain flexibility moving forwards I think we want to have something that flows like
input chain segments -> splitting into more chain segments -> analysis code that reads in an arbitrary number of chain segments
In other words any copying/aggregation into contiguous memory shouldn’t happen until the analysis code. In that way we can rewrite the underlying analysis code to avoid copying in the future if desired.
On Apr 27, 2019, at 1:10 PM, Edward A. Roualdes [email protected] wrote:
I’m wondering if internally this all can’t be handled with Eigen maps?
I thought about and tried this, but I convinced myself that it won't work. An Eigen map is created with a pointer to a contiguous block of memory and the dimension(s). The trouble is, as far as I know, std::vector<const double*>& draws will not necessarily keep all arrays next to each other, despite the fact that the elements of each double* are contiguous.
This also means split_chains() necessarily does a copy.
If you've got 5 - 20 minutes to educate me about Eigen maps or convince yourself that I'm on to something, please do.
Here's my a link to my work-in-progress branch https://github.com/stan-dev/stan/compare/develop...roualdes:feature/issue-2752-update-rhat-ess, which is on my fork of stan.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2752#issuecomment-487303518, or mute the thread https://github.com/notifications/unsubscribe-auth/AALU3FROB3C67NZTNMSZVADPSSCILANCNFSM4HHKZHUA.
On Apr 28, 2019, at 1:52 PM, Michael Betancourt [email protected] wrote:
I wasn’t thinking of merging everything into a matrix via a map but rather handling the initial chains, or any splitting there of, as a collection of vector maps.
That seems fine. It's the locality of the individual chains that really matters here as that's the unit at which we use the data.
I guess it all comes down to how the autocorrelations are handled. Welford accumulator estimators can run on each chain segment and hence don’t need everything to be in one big contiguous block of memory.
Right. It's just that reading out of contiguous memory is faster.
Plus the individual estimators can be computed in parallel. On the other hand FFT estimators, as are currently employed, probably need the continuity.
They assume an Eigen structure as input---it may not need to be literally stored as a dense matrix, but it'll probably do an internal copy if it's not for efficiency. So if FFT's already going to do that, we might as well do it ourselves to save it the trouble and let us use it elsewhere.
Anyways, to maintain flexibility moving forwards I think we want to have something that flows like
input chain segments -> splitting into more chain segments -> analysis code that reads in an arbitrary number of chain segments
In other words any copying/aggregation into contiguous memory shouldn’t happen until the analysis code. In that way we can rewrite the underlying analysis code to avoid copying in the future if desired.
The copying/aggreation into memory is happening in RStan, for example, on an iteration-by-iteration basis. In CmdStan, we just write to disk, but it's still local iteration-by-iteration.
I think reading all the data in and transposing then calculating R-hat, ESS, etc. will be more efficient. It's just that it's also more memory intensive.
I wasn’t thinking of merging everything into a matrix via a map but rather handling the initial chains, or any splitting there of, as a collection of vector maps.
They way I've coded it, if you don't call stan::analyze::split_chains(), then it uses vector maps and the memory provided by the client. On the other hand, I couldn't figure out how to reshape a map. So anytime stan::analyze::split_chains() is called, something closer to
reading all the data in and transposing then calculating R-hat, ESS, etc
is happening.
As it's my understanding that I'm quite close to respecting the discussion points so far, I'm going to push on. I'm going to finish up rhat, some documentation, and a couple of other things before I submit a PR.
Thank you all for your time.
They way I've coded it, if you don't call stan::analyze::split_chains(), then it uses vector maps and the memory provided by the client. On the other hand, I couldn't figure out how to reshape a map.
I’m not sure why you’d have to deal with any reshaping — splitting the chains can be accomplished by creating a new collection of std::vectors that point to segments of the input std::vector memory.
So anytime stan::analyze::split_chains() is called, something closer to
reading all the data in and transposing then calculating R-hat, ESS, etc
is happening
This worries me a bit — the implementation of analyze should be independent of any possible intermediate splitting. If split chains might not be contiguous, and require require copying into contiguous memory, then the initial chains might also not be contiguous and require the copy.
In other words, analyze shouldn’t know whether the chains where split or not. It just consumes an array of chains segments and runs the diagnostics ignorant of where those segments came from.
In any case this can be addressed in the PR if you prefer.
splitting the chains can be accomplished by creating a new collection of std::vectors that point to segments of the input std::vector memory
That's a reasonable way to split chains in order to use the memory provided by the client. But I still don't see how to use the memory provided by the client and separate split_chains() from compute_effective_sample_size(). Even if split_chains() created a new collection of std::vectors that point to segments of the input std::vector memory, then you'd still need to return a second std::vector that records the sizes of the new (halved) chains -- or put call split_chains() within the scope of compute_effective_sample_size(), which I'm trying to avoid.
I guess one could have split_chains() return an Eigen map of a new collection of std::vectors that point to segments of the input std::vector memory. But it's my understanding that a copy would be incurred as soon as an Eigen Map is returned from a function. If somebody knows something about the interplay between functions that return an Eigen map and return value optimization, please share.
Without forcing split_chains() to have multiple return values (std::pair or something), compute_effective_sample_size() still needs both the number of chains and the length of each chain; its arguments are const std::vector<const double*>& draws, const std::vector<size_t>& sizes.
I keep trying to do the calculations on Eigen data structures since an Eigen::MatrixXd carries along both the number of rows (iterations) and columns (number of chains). But reshaping an Eigen Map isn't obvious to me.
This worries me a bit — the implementation of analyze should be independent of any possible intermediate splitting.
That's fair, and I hope it was only due to my lack of clarity. Neither of the two functions I'm proposing to add, Eigen::MatrixXd split_chains(const std::vector<const double*>& draws, const std::vector<size_t>& sizes) nor double compute_effective_sample_size(const Eigen::MatrixBase<Derived>& draws) require knowledge of whether or not its arguments were/are split.
Under my proposal, if an interface wants to first split chains and then calculate ESS all in C++, they would call compute_effective_sample_size(split_chains(...)). As it stands, split_chains() copies into an Eigen::MatrixXd of double the number of chains and half the number of iterations. As far as I know, Eigen matrices are in column-major order. Thus, if the user calls split_chains() first, then all calculations would be performed on contiguous memory with respect to each chain. It's my understanding that this is essentially the transpose that Bob was speaking about.
On the other hand, if the user does not call split_chains() first, then the candidate function is (the already available) double compute_effective_sample_size(std::vector<const double*> draws, std::vector<size_t> sizes). This function, too, does not require knowledge of ~~the number of chains nor the number of iterations~~ whether or not the chains were/are split, but does directly use the memory provided by the client via Eigen vector maps.
But this requires two implementations of the analyze function, right? That’s what I want to move away from — it increases the maintenance and testing burden significantly and will add inertia to future tweaks and improvements.
std::vector<double*> is cheap, so is std::vector
The “transpose” Bob is referring to comes about when one wants to use the FFT to compute the autocovariances of each parameter. If we move away from that functionality then we can do everything without needing memory locality between the chains and work with the contiguity guaranteed by the std::vectors alone.
On Apr 29, 2019, at 7:49 PM, Edward A. Roualdes [email protected] wrote:
splitting the chains can be accomplished by creating a new collection of std::vectors that point to segments of the input std::vector memory
That's a reasonable way to split chains in order to use the memory provided by the client. But I still don't see how to use the memory provided by the client and separate split_chains() from compute_effective_sample_size(). Even if split_chains() created a new collection of std::vectors that point to segments of the input std::vector memory, then you'd still need to return a second std::vector that records the sizes of the new (halved) chains -- or put call split_chains() within the scope of compute_effective_sample_size(), which I'm trying to avoid.
I guess one could have split_chains() return an Eigen map of a new collection of std::vectors that point to segments of the input std::vector memory. But it's my understanding that a copy would be incurred as soon as an Eigen Map is returned from a function. If somebody knows something about the interplay between functions that return an Eigen map and return value optimization, please share.
Without forcing split_chains() to have multiple return values (std::pair or something), compute_effective_sample_size() still needs both the number of chains and the length of each chain; its arguments are const std::vector<const double*>& draws, const std::vector<size_t>& sizes.
I keep trying to do the calculations on Eigen data structures since an Eigen::MatrixXd carries along both the number of rows (iterations) and columns (number of chains). But reshaping an Eigen Map isn't obvious to me.
This worries me a bit — the implementation of analyze should be independent of any possible intermediate splitting.
That's fair, and I hope it was only due to my lack of clarity. Neither of the two functions I'm proposing to add, Eigen::MatrixXd split_chains(const std::vector<const double*>& draws, const std::vector<size_t>& sizes) nor double compute_effective_sample_size(const Eigen::MatrixBase<Derived>& draws) require knowledge of whether or not its arguments were/are split.
Under my proposal, if an interface wants to first split chains and then calculate ESS all in C++, they would call compute_effective_sample_size(split_chains(...)). As it stands, split_chains() copies into an Eigen::MatrixXd of double the number of chains and half the number of iterations. As far as I know, Eigen matrices are in column-major order. Thus, if the user calls split_chains() first, then all calculations would be performed on contiguous memory with respect to each chain. It's my understanding that this is essentially the transpose that Bob was speaking about.
On the other hand, if the user does not call split_chains() first, then the candidate function is (the already available) double compute_effective_sample_size(std::vector<const double*> draws, std::vector<size_t> sizes). This function, too, does not require knowledge of the number of chains nor the number of iterations, but does directly use the memory provided by the client via Eigen vector maps.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2752#issuecomment-487782058, or mute the thread https://github.com/notifications/unsubscribe-auth/AALU3FSU6FWNJXCH7XRRSKDPS6CQVANCNFSM4HHKZHUA.
On Apr 29, 2019, at 11:06 PM, Michael Betancourt [email protected] wrote:
But this requires two implementations of the analyze function, right? That’s what I want to move away from — it increases the maintenance and testing burden significantly and will add inertia to future tweaks and improvements.
std::vector<double*> is cheap, so is std::vector
.
I wouldn't call either cheap in the sense that both manage their own heap-based memory. But if this is a std::vector<double*> where the vector is a vector of draws, each of which is a double*, then that's going to be super slow for any parameter-wise operations. Now we could go through and do something like streaming updates Welford-style, but when we have a lot of parameters, we'd have to be careful to stream in and out the accumulators efficiently.
Consequently there’s no issue with split_chains creating new versions of each to pass onto the analyze functions. Unless I’m missing something?
You mean you'd have a split view of some kind? Unlike strings, there's no efficient way to create a view. But we could do that some other way without literally splitting.
Having to have copies of everything in memory all at once is super expensive when viewed as max dynamic memory required. That's why the transpose solution is expensive.
The “transpose” Bob is referring to comes about when one wants to use the FFT to compute the autocovariances of each parameter.
It's literally a transpose if the draws are organized as a (draw x parameter) matrix.
If we move away from that functionality then we can do everything without needing memory locality between the chains and work with the contiguity guaranteed by the std::vectors alone.
What's contiguous, draws, or parameters?
I wouldn't call either cheap in the sense that both manage their own heap-based memory. But if this is a std::vector<double*> where the vector is a vector of draws, each of which is a double*, then that's going to be super slow for any parameter-wise operations. Now we could go through and do something like streaming updates Welford-style, but when we have a lot of parameters, we'd have to be careful to stream in and out the accumulators efficiently.
I think we’re talking past each other here.
The interfaces are supposed to supply their chains as a std::vector<double*>
of addresses pointing to contiguous memory where the chains states begin
with a std::vector
Consequently the std vectors themselves don’t take up much space and are cheap to brute force copy, no?
You mean you'd have a split view of some kind? Unlike strings, there's no efficient way to create a view. But we could do that some other way without literally splitting.
I’m thinking of this as the interfaces supplying vectors of addresses to to contiguous blocks of memory that hold the chain values. Consequently it’s cheap to create new vectors with additional addresses corresponding to where the chains are split.
Having to have copies of everything in memory all at once is super expensive when viewed as max dynamic memory required. That's why the transpose solution is expensive.
Right, this is another issue. If we can break down the analysis code to handle one parameter at a time then we can drastically reduce the memory overhead. There are streaming algorithms for everything needed (I have code for autocorrelations, etc) and that would really boost the scalability of the functionality.
What's contiguous, draws, or parameters?
Both within a chain?
But this requires two implementations of the analyze function, right? That’s what I want to move away from.
Yes. And I agree, this design choice suffers from all the faults you mention. I'm just struggling to see an alternative that doesn't involve overloading compute_effective_sample_size() in even more complex ways.
@betanalpha, would you please be a little more explicit about how you propose to keep split_chains() and compute_effective_sample_size() separate? The ESS calculation will need the lengths of the halved chains, but I don't want to force compute_effective_sample_size() to always split within this function itself. compute_effective_sample_size() shouldn't do the splitting since the functions within monitor.R show us that ess_mean() and ess_bulk() require splitting and ESS calculations to happen independently from each other.
The interfaces are supposed to supply their chains as a std::vector<double*> of addresses pointing to contiguous memory where the chains states begin with a std::vector
defining the chain lengths, right?
I still don't know the memory layout, so let's walk through an example. Let's say I have three chains of two draws and four parameters each:
chain, iteration, theta1, theta2, theta3
1, 1, v1, v2, v3
1, 2, v4, v5, v6
1, 3, v7, v8, v9
2, 1, v10, v11, v12
2, 2, v13, v14, v15
2, 3, v16, v17, v18
How do the 18 values get laid out in memory?
If we're reading out of the sampler produced CSV, these are stored row-major as above, though the chains would be in separate files.
For analysis of things like R-hat and ESS, we need to access to the values in column major unless you're anticipating some kind of Welford-like sweep. That might be more efficient than transposing, but I doubt it when there are more than a hundred or two hundred parameters.
They’re never passing the states themselves, just addresses to the relevant chunks of memory. Consequently the std vectors themselves don’t take up much space and are cheap to brute force copy, no?
Just copying the pointer (reference) to a std::vector is O(1). Making a copy is O(N) in both time and space where N is the size of the vector. This requires malloc on the heap and calls to both constructor and destructor before all is said and done. But the copy itself is relatively cheap as it's all just contiguous memory.
I’m thinking of this as the interfaces supplying vectors of addresses to to contiguous blocks of memory that hold the chain values. Consequently it’s cheap to create new vectors with additional addresses corresponding to where the chains are split.
OK, that make sense.
Having to have copies of everything in memory all at once is super expensive when viewed as max dynamic memory required. That's why the transpose solution is expensive.
There are streaming algorithms for everything needed (I have code for autocorrelations, etc) and that would really boost the scalability of the functionality.
OK, that makes sense no matter what we do because the streaming algorithms are also more numerically stable.
What's contiguous, draws, or parameters?
Both within a chain?
See above. We can't have both because memory is 1D. We get either
row major: v1, v2, v3, ..., v18
or
column major: v1, v4, v7, ... v18
Ah, okay, I see the main point of confusion. I was focusing on the chains and you have been talking about some of the issues within the chains.
Right now the agreement was that the interfaces would provide vectors of pointers to memory where the states of each chain are held. The chain memory is required to be contiguous.
I’m not sure that we made any agreement about how the chain states themselves should be organized, but we definitely should in order to move forwards!
There are two options for within chain organization. We could group by iteration,
param1_iter1, param2_iter1, param1_iter2, param2_iter2, ...
or by parameter,
param1_iter1, param1_iter2, …, param2_iter1, param2_iter2, ...
The latter is compatible with streaming output from a Markov chain and, in particular, how CmdStan output is naturally formatted. The former drastically facilitates chain splitting and streaming summary calculations.
If we group by iteration then the analysis code would have to “transpose” at least once, requiring a copy and putting the entire chain contents into memory. Splitting functions could still be implemented in a natural way, selection the appropriate values to reconstruct the subchain states grouped by iteration in a sequence of copies before each subchain is “transposed” within the analysis code.
If we group by parameter then the analysis code could run on the input memory directly requiring no copies. Splitting functions would then be straightforward to implement for a single parameter, but more troublesome if the parameters had to be split and then reconstructed into a single block of contiguous chain memory. This approach would also put more burden on the interfaces to corral their output into the marginal parameter values.
This raises one more possibility — the analysis functions, including the splitting, operate on a single parameter at a time. The signatures would look similar except that each chain pointer points to a contiguous block of memory holding just
param1_iter1, param1_iter2, …
Now splitting is straightforward and doesn’t require any copies and we need only one parameter in memory at a time allowing the functionality to scale to larger problems (and potentially even setting things up for streaming inputs to scale up arbitrarily big).
This third approach also requires a bit of work from the interfaces, but I think that it’s manageable. It’s certainly straightforward with some awk magic (or equivalent) from CmdStan csv output and not too problematic from RStan and PyStan which already have everything in memory.
Sorry for the initial confusion! Any thoughts about how to move forwards? I personally am intrigued by the third option because it allows us to
a) Move towards analyzing each parameter/random variable/ pushforward distribution one by one which is where things are moving.
b) Doesn’t obstruct allowing the interfaces to scale to large outputs, as least as much as possible.
Any thoughts?
On Apr 30, 2019, at 3:10 PM, Bob Carpenter [email protected] wrote:
The interfaces are supposed to supply their chains as a std::vector<double*> of addresses pointing to contiguous memory where the chains states begin with a std::vector defining the chain lengths, right?
I still don't know the memory layout, so let's walk through an example. Let's say I have three chains of two draws and four parameters each:
chain, iteration, theta1, theta2, theta3 1, 1, v1, v2, v3 1, 2, v4, v5, v6 1, 3, v7, v8, v9
2, 1, v10, v11, v12 2, 2, v13, v14, v15 2, 3, v16, v17, v18 How do the 18 values get laid out in memory?
If we're reading out of the sampler produced CSV, these are stored row-major as above, though the chains would be in separate files.
For analysis of things like R-hat and ESS, we need to access to the values in column major unless you're anticipating some kind of Welford-like sweep. That might be more efficient than transposing, but I doubt it when there are more than a hundred or two hundred parameters.
They’re never passing the states themselves, just addresses to the relevant chunks of memory. Consequently the std vectors themselves don’t take up much space and are cheap to brute force copy, no?
Just copying the pointer (reference) to a std::vector is O(1). Making a copy is O(N) in both time and space where N is the size of the vector. This requires malloc on the heap and calls to both constructor and destructor before all is said and done. But the copy itself is relatively cheap as it's all just contiguous memory.
I’m thinking of this as the interfaces supplying vectors of addresses to to contiguous blocks of memory that hold the chain values. Consequently it’s cheap to create new vectors with additional addresses corresponding to where the chains are split.
OK, that make sense.
Having to have copies of everything in memory all at once is super expensive when viewed as max dynamic memory required. That's why the transpose solution is expensive.
There are streaming algorithms for everything needed (I have code for autocorrelations, etc) and that would really boost the scalability of the functionality.
OK, that makes sense no matter what we do because the streaming algorithms are also more numerically stable.
What's contiguous, draws, or parameters?
Both within a chain?
See above. We can't have both because memory is 1D. We get either
row major: v1, v2, v3, ..., v18
or
column major: v1, v4, v7, ... v18
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2752#issuecomment-488077770, or mute the thread https://github.com/notifications/unsubscribe-auth/AALU3FWCH2UWG55YHQ3PMG3PTCKSRANCNFSM4HHKZHUA.
Any thoughts? I don't want to obstruct current work, but I do think that if we can figure out the right conventions now then we will significantly facilitate future work. Especially before any of the interfaces start using these functions and freeze in a convention by default.
Your third idea seems best to me, but possibly because I can't differentiate it from what already exists. This is likely just my misunderstanding. Would you mind giving a short statement about how your third idea is different from what is in stan-dev/stan develop?
I need to review in more detail how each interface stores its chains. Just assuming each interface's maintainer will work it out does not make for a great holistic view. I should have better answers to Bob's questions about memory layout. After this semester ends, my time will free back up.
On May 3, 2019, at 11:54 AM, Michael Betancourt [email protected] wrote:
...
I missed this earlier.
There are two options for within chain organization. We could group by iteration,
param1_iter1, param2_iter1, param1_iter2, param2_iter2, ...
or by parameter,
param1_iter1, param1_iter2, …, param2_iter1, param2_iter2, ...
The latter is compatible with streaming output from a Markov chain and, in particular, how CmdStan output is naturally formatted.
I think that's backwards. If we have parameters mu and sigma, we get: mu[1], sigma[1], mu[2], sigma[2], ... out of CmdStan if you look at the CSV output. That is, it's local by draw, not by parameter.
There's also the issue of how to order by chain when there are multiple chains, but that's less of an issue.
Mitzi's building CmdStanPy to use pandas to do an implicit transpose and store a column-major data frame so the draws for a parameter in a chain are grouped together in memory.
The former drastically facilitates chain splitting and streaming summary calculations.
Isn't that the latter? I just think you get the order reversed here.
...
This raises one more possibility — the analysis functions, including the splitting, operate on a single parameter at a time.
I think that'd be cleanest to implement. Everything's done marginally anyway. If we get some multivariate summaries, we could add those later.
The signatures would look similar except that each chain pointer points to a contiguous block of memory holding just
param1_iter1, param1_iter2, …
Now splitting is straightforward and doesn’t require any copies and we need only one parameter in memory at a time allowing the functionality to scale to larger problems (and potentially even setting things up for streaming inputs to scale up arbitrarily big).
We also need to know how to index the chains.
This third approach also requires a bit of work from the interfaces, but I think that it’s manageable. It’s certainly straightforward with some awk magic (or equivalent) from CmdStan csv output and not too problematic from RStan and PyStan which already have everything in memory.
Sorry for the initial confusion! Any thoughts about how to move forwards? I personally am intrigued by the third option because it allows us to
a) Move towards analyzing each parameter/random variable/ pushforward distribution one by one which is where things are moving.
b) Doesn’t obstruct allowing the interfaces to scale to large outputs, as least as much as possible.
For (b), we'd need to not read the whole set of CSV files into memory, but read parameter by parameter if we really want to scale. If we do read them in, they have to be read into the right order of memory so they don't later need to be transposed, as that would require even more memory overhead.
Your third idea seems best to me, but possibly because I can't differentiate it from what already exists. This is likely just my misunderstanding. Would you mind giving a short statement about how your third idea is different from what is in stan-dev/stan develop?
It depends on the exact structure of the chains currently being used.
I need to review in more detail how each interface stores its chains. Just assuming each interface's maintainer will work it out does not make for a great holistic view.
We don’t want to make using these functions onerous for the interface clients but at the same time this is our opportunity to establish a useful convention moving forwards to anchor the interfaces.
I should have better answers to Bob's questions about memory layout.
Establishing a good convention will definitely be the first step forwards.
After this semester ends, my time will free back up.
This definitely sounds like a summer project.
There are two options for within chain organization. We could group by iteration,
param1_iter1, param2_iter1, param1_iter2, param2_iter2, ...
or by parameter,
param1_iter1, param1_iter2, …, param2_iter1, param2_iter2, ...
The latter is compatible with streaming output from a Markov chain and, in particular, how CmdStan output is naturally formatted.
I think that's backwards. If we have parameters mu and sigma, we get: mu[1], sigma[1], mu[2], sigma[2], ... out of CmdStan if you look at the CSV output. That is, it's local by draw, not by parameter.
Sorry, you’re correct.
Mitzi's building CmdStanPy to use pandas to do an implicit transpose and store a column-major data frame so the draws for a parameter in a chain are grouped together in memory.
This requires putting everything into memory to do the transpose. What I’m hoping to set up is a convention that avoids having to transpose anything if the interface doesn’t want it to do.
The former drastically facilitates chain splitting and streaming summary calculations.
Isn't that the latter? I just think you get the order reversed here.
Yup.
The signatures would look similar except that each chain pointer points to a contiguous block of memory holding just
param1_iter1, param1_iter2, …
Now splitting is straightforward and doesn’t require any copies and we need only one parameter in memory at a time allowing the functionality to scale to larger problems (and potentially even setting things up for streaming inputs to scale up arbitrarily big).
We also need to know how to index the chains.
Does anything depend on the order of the chains? The calculations are all exchangeable, right?
This third approach also requires a bit of work from the interfaces, but I think that it’s manageable. It’s certainly straightforward with some awk magic (or equivalent) from CmdStan csv output and not too problematic from RStan and PyStan which already have everything in memory.
Sorry for the initial confusion! Any thoughts about how to move forwards? I personally am intrigued by the third option because it allows us to
a) Move towards analyzing each parameter/random variable/ pushforward distribution one by one which is where things are moving.
b) Doesn’t obstruct allowing the interfaces to scale to large outputs, as least as much as possible.
For (b), we'd need to not read the whole set of CSV files into memory, but read parameter by parameter if we really want to scale. If we do read them in, they have to be read into the right order of memory so they don't later need to be transposed, as that would require even more memory overhead.
Right — again I’m thinking of CSVs on the command line.
awk will scan through and dump only a single column into
memory very efficiently without having to put everything
into memory at any intermediate step. Not sure if that’s
interpretable as an implicit transpose or not, but it
demonstrates that it shouldn’t be hard for interfaces to
prepare marginal chains efficiently.
The diagnostics could even then be parallelized...
On May 15, 2019, at 10:54 PM, Michael Betancourt [email protected] wrote:
There are two options for within chain organization. We could group by iteration,
param1_iter1, param2_iter1, param1_iter2, param2_iter2, ...
or by parameter,
param1_iter1, param1_iter2, …, param2_iter1, param2_iter2, ...
The latter is compatible with streaming output from a Markov chain and, in particular, how CmdStan output is naturally formatted.
I think that's backwards. If we have parameters mu and sigma, we get: mu[1], sigma[1], mu[2], sigma[2], ... out of CmdStan if you look at the CSV output. That is, it's local by draw, not by parameter.
Sorry, you’re correct.
Mitzi's building CmdStanPy to use pandas to do an implicit transpose and store a column-major data frame so the draws for a parameter in a chain are grouped together in memory.
This requires putting everything into memory to do the transpose.
Only the memory for the transposed form ever gets loaded into memory. It doesn't do a load then a transpose, which would be terrible. But it does require the full output to be loadable.
What I’m hoping to set up is a convention that avoids having to transpose anything if the interface doesn’t want it to do.
The transposition has to happen either implicitly or explicitly because the output comes per iteration and the analysis happens per parameter. What we could do, though, is just read one parameter at a time out of the CSV files. That'd be most scalable, but it'd be a lot of file system and I/O pressure to keep scanning CSV files. At some point, you'd want to move to a database with appropriate indexing.
...
We also need to know how to index the chains.
Does anything depend on the order of the chains? The calculations are all exchangeable, right?
Nope. Other than perhaps linking them up to their adaptation parameters, etc.
This third approach also requires a bit of work from the interfaces, but I think that it’s manageable. It’s certainly straightforward with some awk magic (or equivalent) from CmdStan csv output and not too problematic from RStan and PyStan which already have everything in memory.
Sorry for the initial confusion! Any thoughts about how to move forwards? I personally am intrigued by the third option because it allows us to
a) Move towards analyzing each parameter/random variable/ pushforward distribution one by one which is where things are moving.
b) Doesn’t obstruct allowing the interfaces to scale to large outputs, as least as much as possible.
For (b), we'd need to not read the whole set of CSV files into memory, but read parameter by parameter if we really want to scale. If we do read them in, they have to be read into the right order of memory so they don't later need to be transposed, as that would require even more memory overhead.
Right — again I’m thinking of CSVs on the command line. awk will scan through and dump only a single column into memory very efficiently without having to put everything into memory at any intermediate step.
OK, we're on the same page then.
Not sure if that’s interpretable as an implicit transpose or not,
It's even worse in that all the data gets read from the file system for each column. At least I don't think awk is clever enough to memory map. But even if it memory mapped, it'd be a paging disaster with files with lots of parameters.
So I'm not sure we're going to find file sizes where this will work. They'd have to be big enough to prohibit reading into memory, at which point, scanning the files multiple times may be prohibitive. Have you tried it?
- Bob
The transposition has to happen either implicitly or explicitly because the output comes per iteration and the analysis happens per parameter. What we could do, though, is just read one parameter at a time out of the CSV files. That'd be most scalable, but it'd be a lot of file system and I/O pressure to keep scanning CSV files. At some point, you'd want to move to a database with appropriate indexing.
Sure. A lot will depend on how efficient the interface readers are, but given how incredibly fast awk is it should be possible to scan through individual parameters efficiently.
Not sure if that’s interpretable as an implicit transpose or not,
It's even worse in that all the data gets read from the file system for each column. At least I don't think awk is clever enough to memory map. But even if it memory mapped, it'd be a paging disaster with files with lots of parameters.
So I'm not sure we're going to find file sizes where this will work. They'd have to be big enough to prohibit reading into memory, at which point, scanning the files multiple times may be prohibitive. Have you tried it?
I’ve definitely run into problems where parsing parameters one at a time through awk is much faster than trying to manage the entire memory burden. This is common with models that contain lots of generated quantities useful for prediction but not for diagnostics. In other words the parameter-by-parameter analysis is nicely compatible with filtered diagnostics that consider only a subset of parameters.
I'm still a few posts back, so bare with me as I try to catch up. RStan and PyStan both store draws x chains x parameters. R is always column-major. PyStan is column-major also, by forcing Fortran ordering via numpy. Based on this thread it also sounds like CmdStanPy will also be column-major via Pandas. So the answer to
How do the 18 values get laid out in memory?
is
column major: v1, v4, v7, ... v18
for RStan, PyStan, and CmdStanPy.
Lately, we are more interested in CSVs on the command line (presumably with CmdStan?). The dangling question is, do we need to read in all draws into memory and transpose, or is Awk fast enough to pull out specific columns of each CSV so as to store multiple chains of a single parameter? Here's some ballpark numbers from a quick experiment of essentially the following Stan program with 5_000_000 samples and 1 chain:
generated quantities {
real z[10];
for (i in 1:10)
z[i] = normal_rng(0, 1);
}
Reading in the Stan ouput CSV file using numpy.loadtxt() and forcing column-major ordering (a transpose) took approximately 50 seconds. Reading in the third column of the same CSV file with Awk via
time awk -F "\"*,\"*" 'FNR>31 {print $3}' samples1.csv > /dev/null 2>&1
took roughly 4.5 seconds, with an additional ~half-second for each next column. To read in the 12th column, the above Awk command took about 9.5 seconds. Total that's about 70 seconds for columns 3 through 12; roughly 20 seconds slower than reading the entire CSV + transpose. This by no means intends to be a conclusive experiment, especially since I just learned enough Awk to do this. Instead, I'm just putting some numbers on the board.
Now, the above question, about Awk versus reading everything into memory and transposing, implicitly assumes that we've agreed on Michael's third option where
each chain pointer points to a contiguous block of memory holding just param1_iter1, param1_iter2, …
It doesn't seem like anybody disagrees about this, but I think we should at least make it explicit that we've agreed upon this much, if we in fact agree upon this. I'm all for this third option.
CmdStan stores a single chain per file, with memory locality by iteration. That is, with M draws and K parameters, we have the order
theta[1, 1], theta[1, 2], ..., theta[1, K], theta[2, 1], ..., theta[M, 1], ..., theta[M, K]
So that gives us the ordering
CmdStan CSV file layout: (chain, iteration, parameter)
When you run RStan, the fit object you get back is a huge structure. The draws are stored within a list of chains, where each chain is a list of parameters, and each parameter is an array of values indexed by iteration. So that gives us
RStan fit object layout: (chain, parameter, iteration)
I don't even know that much awk. It's going to scale worse as the number of parameters grows as it'll have to page over more of the file to get a single column.