example-models Whitelist of example models

As discussed in the Stan meeting on Tuesday, we're going to create a whitelist of 20 models that will serve as example models for rstan users that we'd actually like users to learn from. The whitelisted models will satisfy the following criteria.:

Stan code that matches our current best-practices standards for efficiency, clarity, and priors
Has a dataset adhering to the naming convention referenced by @bgoodri in #45
Runs quickly (less than 2 minutes?)
All Rhats below 1.1
No n_divergent draws
Represent models we'd like Stan to be known for (H(G)LM, IRT, etc.)

Mar 02 '16 22:03 rtrangucci

@rayleigh For some reason I can't assign multiple people on Github, but I wanted to assign you to this too.

Mar 02 '16 22:03 rtrangucci

I am not sure running in less than 2 minutes (on what CPU?) is an absolute requirement for a list of models that we want people to learn from. There should be some time limit on a list of models that are intended to be used for automated testing of new (or old) algorithms.

On Wed, Mar 2, 2016 at 5:26 PM, Rob Trangucci [email protected] wrote:

As discussed in the Stan meeting on Tuesday, we're going to create a whitelist of 20 models that will serve as example models for rstan users that we'd actually like users to learn from. The whitelisted models will satisfy the following criteria.:

Stan code that matches our current best-practices standards for efficiency, clarity, and priors

Has a dataset adhering to the naming convention referenced by @bgoodri https://github.com/bgoodri in #45 https://github.com/stan-dev/example-models/issues/45

Runs quickly (less than 2 minutes?)

All Rhats below 1.1

No n_divergent draws

Represent models we'd like Stan to be known for (H(G)LM, IRT, etc.)

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48.

Mar 02 '16 22:03 bgoodri

Ok, we can strike that requirement from the list. Agreed that it's probably too prohibitive a time limit for the whitelist's purpose.

Rob

On Wed, Mar 2, 2016 at 5:30 PM, bgoodri [email protected] wrote:

I am not sure running in less than 2 minutes (on what CPU?) is an absolute requirement for a list of models that we want people to learn from. There should be some time limit on a list of models that are intended to be used for automated testing of new (or old) algorithms.

On Wed, Mar 2, 2016 at 5:26 PM, Rob Trangucci [email protected] wrote:

As discussed in the Stan meeting on Tuesday, we're going to create a whitelist of 20 models that will serve as example models for rstan users that we'd actually like users to learn from. The whitelisted models will satisfy the following criteria.:

Stan code that matches our current best-practices standards for efficiency, clarity, and priors

Has a dataset adhering to the naming convention referenced by @bgoodri https://github.com/bgoodri in #45 https://github.com/stan-dev/example-models/issues/45

Runs quickly (less than 2 minutes?)

All Rhats below 1.1

No n_divergent draws

Represent models we'd like Stan to be known for (H(G)LM, IRT, etc.)

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48.

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48#issuecomment-191465758 .

Mar 02 '16 22:03 rtrangucci

I agree. Indeed, for the purpose of testing ADVI, we want examples that take awhile in NUTS, that’s part of the point. I’d say that, to be in the whitelist, we have to be able to fit it using Nuts and run it to convergence, then we want to save a bunch of information including posterior mean and sd for all parameters, also R-hat and effective sample size for all parameters, also let’s save the runtime. Then if people want to do a test, they can just run the whitelist models that take less than 2 minutes or whatever. But really we can have examples in our whitelist that take hours to run—the point is that such an example only has to be run once.

Also, let’s be clear: it’s not a whitelist of models (a “model” is one block in a Stan program), it’s not even a whitelist of Stan programs. It’s a whitelist of examples, where an “example” includes a Stan program plus data.

A

On Mar 2, 2016, at 5:45 PM, Rob Trangucci [email protected] wrote:

Ok, we can strike that requirement from the list. Agreed that it's probably too prohibitive a time limit for the whitelist's purpose.

Rob

On Wed, Mar 2, 2016 at 5:30 PM, bgoodri [email protected] wrote:

I am not sure running in less than 2 minutes (on what CPU?) is an absolute requirement for a list of models that we want people to learn from. There should be some time limit on a list of models that are intended to be used for automated testing of new (or old) algorithms.

On Wed, Mar 2, 2016 at 5:26 PM, Rob Trangucci [email protected] wrote:

As discussed in the Stan meeting on Tuesday, we're going to create a whitelist of 20 models that will serve as example models for rstan users that we'd actually like users to learn from. The whitelisted models will satisfy the following criteria.:

Stan code that matches our current best-practices standards for efficiency, clarity, and priors

Has a dataset adhering to the naming convention referenced by @bgoodri https://github.com/bgoodri in #45 https://github.com/stan-dev/example-models/issues/45

Runs quickly (less than 2 minutes?)

All Rhats below 1.1

No n_divergent draws

Represent models we'd like Stan to be known for (H(G)LM, IRT, etc.)

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48.

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48#issuecomment-191465758 .

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48#issuecomment-191476458.

Mar 02 '16 22:03 andrewgelman

So are we now talking about two different whitelists? I agree that we should have some models that take a while in NUTS for the purpose of the advi testing, but I don't think we should have models that take a very long time to run for the purpose of demos for users (it's fine if some models in the full set of example models take a long time to run but if a user just wants to run stan_demo so they can play with the output then a model that runs for hours would be very frustrating). These are two very different purposes (both important) so I want to make sure we don't conflate them.

On Wed, Mar 2, 2016 at 5:53 PM, Andrew Gelman [email protected] wrote:

I agree. Indeed, for the purpose of testing ADVI, we want examples that take awhile in NUTS, that’s part of the point. I’d say that, to be in the whitelist, we have to be able to fit it using Nuts and run it to convergence, then we want to save a bunch of information including posterior mean and sd for all parameters, also R-hat and effective sample size for all parameters, also let’s save the runtime. Then if people want to do a test, they can just run the whitelist models that take less than 2 minutes or whatever. But really we can have examples in our whitelist that take hours to run—the point is that such an example only has to be run once.

Also, let’s be clear: it’s not a whitelist of models (a “model” is one block in a Stan program), it’s not even a whitelist of Stan programs. It’s a whitelist of examples, where an “example” includes a Stan program plus data.

A

On Mar 2, 2016, at 5:45 PM, Rob Trangucci [email protected] wrote:

Ok, we can strike that requirement from the list. Agreed that it's probably too prohibitive a time limit for the whitelist's purpose.

Rob

On Wed, Mar 2, 2016 at 5:30 PM, bgoodri [email protected] wrote:

I am not sure running in less than 2 minutes (on what CPU?) is an absolute requirement for a list of models that we want people to learn from. There should be some time limit on a list of models that are intended to be used for automated testing of new (or old) algorithms.

On Wed, Mar 2, 2016 at 5:26 PM, Rob Trangucci < [email protected]> wrote:

As discussed in the Stan meeting on Tuesday, we're going to create a whitelist of 20 models that will serve as example models for rstan users that we'd actually like users to learn from. The whitelisted models will satisfy the following criteria.:

Stan code that matches our current best-practices standards for efficiency, clarity, and priors

Has a dataset adhering to the naming convention referenced by @bgoodri https://github.com/bgoodri in #45 https://github.com/stan-dev/example-models/issues/45

Runs quickly (less than 2 minutes?)

All Rhats below 1.1

No n_divergent draws

Represent models we'd like Stan to be known for (H(G)LM, IRT, etc.)

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48.

— Reply to this email directly or view it on GitHub < https://github.com/stan-dev/example-models/issues/48#issuecomment-191465758

.

— Reply to this email directly or view it on GitHub < https://github.com/stan-dev/example-models/issues/48#issuecomment-191476458 .

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48#issuecomment-191479838 .

Mar 02 '16 23:03 jgabry

I’m thinking of this in the context of advi-test. But the general principle remains: if an example is in the whitelist, it should already have been succesfully fit using Nuts, which means that we’ll know (roughly) how long it will take to fit, which means that information can be saved as metadata in the whitelist (along with parameter estimates, se’s, etc), which means that if a user wants to run a test, he or she would have a sense of how long it would take.

On Mar 2, 2016, at 6:51 PM, Jonah Gabry [email protected] wrote:

So are we now talking about two different whitelists? I agree that we should have some models that take a while in NUTS for the purpose of the advi testing, but I don't think we should have models that take a very long time to run for the purpose of demos for users (it's fine if some models in the full set of example models take a long time to run but if a user just wants to run stan_demo so they can play with the output then a model that runs for hours would be very frustrating). These are two very different purposes (both important) so I want to make sure we don't conflate them.

On Wed, Mar 2, 2016 at 5:53 PM, Andrew Gelman [email protected] wrote:

I agree. Indeed, for the purpose of testing ADVI, we want examples that take awhile in NUTS, that’s part of the point. I’d say that, to be in the whitelist, we have to be able to fit it using Nuts and run it to convergence, then we want to save a bunch of information including posterior mean and sd for all parameters, also R-hat and effective sample size for all parameters, also let’s save the runtime. Then if people want to do a test, they can just run the whitelist models that take less than 2 minutes or whatever. But really we can have examples in our whitelist that take hours to run—the point is that such an example only has to be run once.

Also, let’s be clear: it’s not a whitelist of models (a “model” is one block in a Stan program), it’s not even a whitelist of Stan programs. It’s a whitelist of examples, where an “example” includes a Stan program plus data.

A

On Mar 2, 2016, at 5:45 PM, Rob Trangucci [email protected] wrote:

Ok, we can strike that requirement from the list. Agreed that it's probably too prohibitive a time limit for the whitelist's purpose.

Rob

On Wed, Mar 2, 2016 at 5:30 PM, bgoodri [email protected] wrote:

I am not sure running in less than 2 minutes (on what CPU?) is an absolute requirement for a list of models that we want people to learn from. There should be some time limit on a list of models that are intended to be used for automated testing of new (or old) algorithms.

On Wed, Mar 2, 2016 at 5:26 PM, Rob Trangucci < [email protected]> wrote:

As discussed in the Stan meeting on Tuesday, we're going to create a whitelist of 20 models that will serve as example models for rstan users that we'd actually like users to learn from. The whitelisted models will satisfy the following criteria.:

Stan code that matches our current best-practices standards for efficiency, clarity, and priors

Has a dataset adhering to the naming convention referenced by @bgoodri https://github.com/bgoodri in #45 https://github.com/stan-dev/example-models/issues/45

Runs quickly (less than 2 minutes?)

All Rhats below 1.1

No n_divergent draws

Represent models we'd like Stan to be known for (H(G)LM, IRT, etc.)

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48.

— Reply to this email directly or view it on GitHub < https://github.com/stan-dev/example-models/issues/48#issuecomment-191465758

.

— Reply to this email directly or view it on GitHub < https://github.com/stan-dev/example-models/issues/48#issuecomment-191476458 .

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48#issuecomment-191479838 .

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48#issuecomment-191498881.

Mar 03 '16 00:03 andrewgelman

I agree with Jonah on this. Is this issue to come up with a whitelist for stan_demo in general or for test_ADVI and test_NUTS? If it's for a whitelist for stan_demo, I can create a separate issue for test_ADVI and test_NUTS.

Mar 03 '16 17:03 rayleigh

Run times don't make sense to store --- they're system specific, and also depend on the system load at the time the tests are run.

I don't think we all need to get hung up on "model" being the name of a block in a Stan program. In the language I'm using going forward there are three distinct things:

the mathematical model --- that's a log posterior

  log p(theta | y) 

up to a proportion

an algorithm to implement the model density
a Stan program that implements the algorithm

We almost never talk about (2) as such, but we put a lot of discussion in at this level --- using log_sum_exp, using bernoulli_logit, using dynamic programming for a state-space model, etc. Then there's variation in how you actually program the implementation, e.g., does it use functions or define things inline.

Bob

On Mar 2, 2016, at 5:53 PM, Andrew Gelman [email protected] wrote:

I agree. Indeed, for the purpose of testing ADVI, we want examples that take awhile in NUTS, that’s part of the point. I’d say that, to be in the whitelist, we have to be able to fit it using Nuts and run it to convergence, then we want to save a bunch of information including posterior mean and sd for all parameters, also R-hat and effective sample size for all parameters, also let’s save the runtime. Then if people want to do a test, they can just run the whitelist models that take less than 2 minutes or whatever. But really we can have examples in our whitelist that take hours to run—the point is that such an example only has to be run once.

Also, let’s be clear: it’s not a whitelist of models (a “model” is one block in a Stan program), it’s not even a whitelist of Stan programs. It’s a whitelist of examples, where an “example” includes a Stan program plus data.

A

On Mar 2, 2016, at 5:45 PM, Rob Trangucci [email protected] wrote:

Ok, we can strike that requirement from the list. Agreed that it's probably too prohibitive a time limit for the whitelist's purpose.

Rob

On Wed, Mar 2, 2016 at 5:30 PM, bgoodri [email protected] wrote:

I am not sure running in less than 2 minutes (on what CPU?) is an absolute requirement for a list of models that we want people to learn from. There should be some time limit on a list of models that are intended to be used for automated testing of new (or old) algorithms.

On Wed, Mar 2, 2016 at 5:26 PM, Rob Trangucci [email protected] wrote:

As discussed in the Stan meeting on Tuesday, we're going to create a whitelist of 20 models that will serve as example models for rstan users that we'd actually like users to learn from. The whitelisted models will satisfy the following criteria.:

Stan code that matches our current best-practices standards for efficiency, clarity, and priors

Has a dataset adhering to the naming convention referenced by @bgoodri https://github.com/bgoodri in #45 https://github.com/stan-dev/example-models/issues/45

Runs quickly (less than 2 minutes?)

All Rhats below 1.1

No n_divergent draws

Represent models we'd like Stan to be known for (H(G)LM, IRT, etc.)

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48.

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48#issuecomment-191465758 .

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48#issuecomment-191476458.

— Reply to this email directly or view it on GitHub.

Mar 03 '16 17:03 bob-carpenter

If by "roughly", you mean to within a factor of 2 or 3, barring really bad hardware or slow chains, then yes. Remember, there's a ton of variation across our runs for random inits, as we have to remind ourselves every time we do one of these evals.

Bob

On Mar 2, 2016, at 7:11 PM, Andrew Gelman [email protected] wrote:

I’m thinking of this in the context of advi-test. But the general principle remains: if an example is in the whitelist, it should already have been succesfully fit using Nuts, which means that we’ll know (roughly) how long it will take to fit, which means that information can be saved as metadata in the whitelist (along with parameter estimates, se’s, etc), which means that if a user wants to run a test, he or she would have a sense of how long it would take.

On Mar 2, 2016, at 6:51 PM, Jonah Gabry [email protected] wrote:

So are we now talking about two different whitelists? I agree that we should have some models that take a while in NUTS for the purpose of the advi testing, but I don't think we should have models that take a very long time to run for the purpose of demos for users (it's fine if some models in the full set of example models take a long time to run but if a user just wants to run stan_demo so they can play with the output then a model that runs for hours would be very frustrating). These are two very different purposes (both important) so I want to make sure we don't conflate them.

On Wed, Mar 2, 2016 at 5:53 PM, Andrew Gelman [email protected] wrote:

I agree. Indeed, for the purpose of testing ADVI, we want examples that take awhile in NUTS, that’s part of the point. I’d say that, to be in the whitelist, we have to be able to fit it using Nuts and run it to convergence, then we want to save a bunch of information including posterior mean and sd for all parameters, also R-hat and effective sample size for all parameters, also let’s save the runtime. Then if people want to do a test, they can just run the whitelist models that take less than 2 minutes or whatever. But really we can have examples in our whitelist that take hours to run—the point is that such an example only has to be run once.

Also, let’s be clear: it’s not a whitelist of models (a “model” is one block in a Stan program), it’s not even a whitelist of Stan programs. It’s a whitelist of examples, where an “example” includes a Stan program plus data.

A

On Mar 2, 2016, at 5:45 PM, Rob Trangucci [email protected] wrote:

Ok, we can strike that requirement from the list. Agreed that it's probably too prohibitive a time limit for the whitelist's purpose.

Rob

On Wed, Mar 2, 2016 at 5:30 PM, bgoodri [email protected] wrote:

I am not sure running in less than 2 minutes (on what CPU?) is an absolute requirement for a list of models that we want people to learn from. There should be some time limit on a list of models that are intended to be used for automated testing of new (or old) algorithms.

On Wed, Mar 2, 2016 at 5:26 PM, Rob Trangucci < [email protected]> wrote:

As discussed in the Stan meeting on Tuesday, we're going to create a whitelist of 20 models that will serve as example models for rstan users that we'd actually like users to learn from. The whitelisted models will satisfy the following criteria.:

Stan code that matches our current best-practices standards for efficiency, clarity, and priors

Has a dataset adhering to the naming convention referenced by @bgoodri https://github.com/bgoodri in #45 https://github.com/stan-dev/example-models/issues/45

Runs quickly (less than 2 minutes?)

All Rhats below 1.1

No n_divergent draws

Represent models we'd like Stan to be known for (H(G)LM, IRT, etc.)

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48.

— Reply to this email directly or view it on GitHub < https://github.com/stan-dev/example-models/issues/48#issuecomment-191465758

.

— Reply to this email directly or view it on GitHub < https://github.com/stan-dev/example-models/issues/48#issuecomment-191476458 .

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48#issuecomment-191479838 .

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48#issuecomment-191498881.

— Reply to this email directly or view it on GitHub.

Mar 03 '16 17:03 bob-carpenter

And a huge variation in the hardware people use. Just host a course and be amazed at what hardware still runs after so many years!

On Mar 3, 2016, at 5:21 PM, Bob Carpenter [email protected] wrote:

If by "roughly", you mean to within a factor of 2 or 3, barring really bad hardware or slow chains, then yes. Remember, there's a ton of variation across our runs for random inits, as we have to remind ourselves every time we do one of these evals.

Bob

On Mar 2, 2016, at 7:11 PM, Andrew Gelman [email protected] wrote:

I’m thinking of this in the context of advi-test. But the general principle remains: if an example is in the whitelist, it should already have been succesfully fit using Nuts, which means that we’ll know (roughly) how long it will take to fit, which means that information can be saved as metadata in the whitelist (along with parameter estimates, se’s, etc), which means that if a user wants to run a test, he or she would have a sense of how long it would take.

On Mar 2, 2016, at 6:51 PM, Jonah Gabry [email protected] wrote:

So are we now talking about two different whitelists? I agree that we should have some models that take a while in NUTS for the purpose of the advi testing, but I don't think we should have models that take a very long time to run for the purpose of demos for users (it's fine if some models in the full set of example models take a long time to run but if a user just wants to run stan_demo so they can play with the output then a model that runs for hours would be very frustrating). These are two very different purposes (both important) so I want to make sure we don't conflate them.

On Wed, Mar 2, 2016 at 5:53 PM, Andrew Gelman [email protected] wrote:

I agree. Indeed, for the purpose of testing ADVI, we want examples that take awhile in NUTS, that’s part of the point. I’d say that, to be in the whitelist, we have to be able to fit it using Nuts and run it to convergence, then we want to save a bunch of information including posterior mean and sd for all parameters, also R-hat and effective sample size for all parameters, also let’s save the runtime. Then if people want to do a test, they can just run the whitelist models that take less than 2 minutes or whatever. But really we can have examples in our whitelist that take hours to run—the point is that such an example only has to be run once.

Also, let’s be clear: it’s not a whitelist of models (a “model” is one block in a Stan program), it’s not even a whitelist of Stan programs. It’s a whitelist of examples, where an “example” includes a Stan program plus data.

A

On Mar 2, 2016, at 5:45 PM, Rob Trangucci [email protected] wrote:

Ok, we can strike that requirement from the list. Agreed that it's probably too prohibitive a time limit for the whitelist's purpose.

Rob

On Wed, Mar 2, 2016 at 5:30 PM, bgoodri [email protected] wrote:

I am not sure running in less than 2 minutes (on what CPU?) is an absolute requirement for a list of models that we want people to learn from. There should be some time limit on a list of models that are intended to be used for automated testing of new (or old) algorithms.

On Wed, Mar 2, 2016 at 5:26 PM, Rob Trangucci < [email protected]> wrote:

As discussed in the Stan meeting on Tuesday, we're going to create a whitelist of 20 models that will serve as example models for rstan users that we'd actually like users to learn from. The whitelisted models will satisfy the following criteria.:

Stan code that matches our current best-practices standards for efficiency, clarity, and priors

Has a dataset adhering to the naming convention referenced by @bgoodri https://github.com/bgoodri in #45 https://github.com/stan-dev/example-models/issues/45

Runs quickly (less than 2 minutes?)

All Rhats below 1.1

No n_divergent draws

Represent models we'd like Stan to be known for (H(G)LM, IRT, etc.)

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48.

— Reply to this email directly or view it on GitHub < https://github.com/stan-dev/example-models/issues/48#issuecomment-191465758

.

— Reply to this email directly or view it on GitHub < https://github.com/stan-dev/example-models/issues/48#issuecomment-191476458 .

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48#issuecomment-191479838 .

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48#issuecomment-191498881.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

Mar 03 '16 17:03 betanalpha

I think we've gotten our lines of communication crossed here, and it's my fault for not being more explicit in the issue.

The model whitelist, or Stan program whitelist, is aimed at defining a smallish set of Stan programs that we'd like rstan users to run using stan_demo that are reflective of current best-practices for coding, priors, and, of course, that also converge with no pathologies in the sampling. In my mind, these programs are meant to be the live end-products of the step-by-step iterative code optimization and prior recommendations laid out in our manual (e.g. see section 6.12 on Multivariate Priors for Hierarchical Models). These examples could even be expanded upon in the knitr folder so we have nice cohesion between stan_demo and knitr.

My sense is that we all agree that it is not good that users can currently run Stan programs with poorly optimized code or code that doesn't parse from the stan_demo function. It would be useful to limit users to running a subset of the example-models that we've made sure pass our code/statistics/inference standards.

The Stan program whitelist is connected to @rayleigh's ADVI project only insofar that his project will call stan_demo to get a list of Stan programs and data associated with those programs.

Mar 03 '16 17:03 rtrangucci

Yes, within a half an order of magnitude is fine.

On Mar 3, 2016, at 12:21 PM, Bob Carpenter [email protected] wrote:

If by "roughly", you mean to within a factor of 2 or 3, barring really bad hardware or slow chains, then yes. Remember, there's a ton of variation across our runs for random inits, as we have to remind ourselves every time we do one of these evals.

Bob

On Mar 2, 2016, at 7:11 PM, Andrew Gelman [email protected] wrote:

I’m thinking of this in the context of advi-test. But the general principle remains: if an example is in the whitelist, it should already have been succesfully fit using Nuts, which means that we’ll know (roughly) how long it will take to fit, which means that information can be saved as metadata in the whitelist (along with parameter estimates, se’s, etc), which means that if a user wants to run a test, he or she would have a sense of how long it would take.

On Mar 2, 2016, at 6:51 PM, Jonah Gabry [email protected] wrote:

So are we now talking about two different whitelists? I agree that we should have some models that take a while in NUTS for the purpose of the advi testing, but I don't think we should have models that take a very long time to run for the purpose of demos for users (it's fine if some models in the full set of example models take a long time to run but if a user just wants to run stan_demo so they can play with the output then a model that runs for hours would be very frustrating). These are two very different purposes (both important) so I want to make sure we don't conflate them.

On Wed, Mar 2, 2016 at 5:53 PM, Andrew Gelman [email protected] wrote:

I agree. Indeed, for the purpose of testing ADVI, we want examples that take awhile in NUTS, that’s part of the point. I’d say that, to be in the whitelist, we have to be able to fit it using Nuts and run it to convergence, then we want to save a bunch of information including posterior mean and sd for all parameters, also R-hat and effective sample size for all parameters, also let’s save the runtime. Then if people want to do a test, they can just run the whitelist models that take less than 2 minutes or whatever. But really we can have examples in our whitelist that take hours to run—the point is that such an example only has to be run once.

Also, let’s be clear: it’s not a whitelist of models (a “model” is one block in a Stan program), it’s not even a whitelist of Stan programs. It’s a whitelist of examples, where an “example” includes a Stan program plus data.

A

On Mar 2, 2016, at 5:45 PM, Rob Trangucci [email protected] wrote:

Ok, we can strike that requirement from the list. Agreed that it's probably too prohibitive a time limit for the whitelist's purpose.

Rob

On Wed, Mar 2, 2016 at 5:30 PM, bgoodri [email protected] wrote:

I am not sure running in less than 2 minutes (on what CPU?) is an absolute requirement for a list of models that we want people to learn from. There should be some time limit on a list of models that are intended to be used for automated testing of new (or old) algorithms.

On Wed, Mar 2, 2016 at 5:26 PM, Rob Trangucci < [email protected]> wrote:

As discussed in the Stan meeting on Tuesday, we're going to create a whitelist of 20 models that will serve as example models for rstan users that we'd actually like users to learn from. The whitelisted models will satisfy the following criteria.:

Stan code that matches our current best-practices standards for efficiency, clarity, and priors

Has a dataset adhering to the naming convention referenced by @bgoodri https://github.com/bgoodri in #45 https://github.com/stan-dev/example-models/issues/45

Runs quickly (less than 2 minutes?)

All Rhats below 1.1

No n_divergent draws

Represent models we'd like Stan to be known for (H(G)LM, IRT, etc.)

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48.

— Reply to this email directly or view it on GitHub < https://github.com/stan-dev/example-models/issues/48#issuecomment-191465758

.

— Reply to this email directly or view it on GitHub < https://github.com/stan-dev/example-models/issues/48#issuecomment-191476458 .

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48#issuecomment-191479838 .

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48#issuecomment-191498881.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48#issuecomment-191869715.

Mar 03 '16 21:03 andrewgelman

I don't see the utility, but this is your research project, not mine.

Bob

On Mar 3, 2016, at 4:57 PM, Andrew Gelman [email protected] wrote:

Yes, within a half an order of magnitude is fine.

On Mar 3, 2016, at 12:21 PM, Bob Carpenter [email protected] wrote:

If by "roughly", you mean to within a factor of 2 or 3, barring really bad hardware or slow chains, then yes. Remember, there's a ton of variation across our runs for random inits, as we have to remind ourselves every time we do one of these evals.

Bob

On Mar 2, 2016, at 7:11 PM, Andrew Gelman [email protected] wrote:

I’m thinking of this in the context of advi-test. But the general principle remains: if an example is in the whitelist, it should already have been succesfully fit using Nuts, which means that we’ll know (roughly) how long it will take to fit, which means that information can be saved as metadata in the whitelist (along with parameter estimates, se’s, etc), which means that if a user wants to run a test, he or she would have a sense of how long it would take.

On Mar 2, 2016, at 6:51 PM, Jonah Gabry [email protected] wrote:

So are we now talking about two different whitelists? I agree that we should have some models that take a while in NUTS for the purpose of the advi testing, but I don't think we should have models that take a very long time to run for the purpose of demos for users (it's fine if some models in the full set of example models take a long time to run but if a user just wants to run stan_demo so they can play with the output then a model that runs for hours would be very frustrating). These are two very different purposes (both important) so I want to make sure we don't conflate them.

On Wed, Mar 2, 2016 at 5:53 PM, Andrew Gelman [email protected] wrote:

I agree. Indeed, for the purpose of testing ADVI, we want examples that take awhile in NUTS, that’s part of the point. I’d say that, to be in the whitelist, we have to be able to fit it using Nuts and run it to convergence, then we want to save a bunch of information including posterior mean and sd for all parameters, also R-hat and effective sample size for all parameters, also let’s save the runtime. Then if people want to do a test, they can just run the whitelist models that take less than 2 minutes or whatever. But really we can have examples in our whitelist that take hours to run—the point is that such an example only has to be run once.

Also, let’s be clear: it’s not a whitelist of models (a “model” is one block in a Stan program), it’s not even a whitelist of Stan programs. It’s a whitelist of examples, where an “example” includes a Stan program plus data.

A

On Mar 2, 2016, at 5:45 PM, Rob Trangucci [email protected] wrote:

Ok, we can strike that requirement from the list. Agreed that it's probably too prohibitive a time limit for the whitelist's purpose.

Rob

On Wed, Mar 2, 2016 at 5:30 PM, bgoodri [email protected] wrote:

I am not sure running in less than 2 minutes (on what CPU?) is an absolute requirement for a list of models that we want people to learn from. There should be some time limit on a list of models that are intended to be used for automated testing of new (or old) algorithms.

On Wed, Mar 2, 2016 at 5:26 PM, Rob Trangucci < [email protected]> wrote:

As discussed in the Stan meeting on Tuesday, we're going to create a whitelist of 20 models that will serve as example models for rstan users that we'd actually like users to learn from. The whitelisted models will satisfy the following criteria.:

Stan code that matches our current best-practices standards for efficiency, clarity, and priors

Has a dataset adhering to the naming convention referenced by @bgoodri https://github.com/bgoodri in #45 https://github.com/stan-dev/example-models/issues/45

Runs quickly (less than 2 minutes?)

All Rhats below 1.1

No n_divergent draws

Represent models we'd like Stan to be known for (H(G)LM, IRT, etc.)

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48.

— Reply to this email directly or view it on GitHub < https://github.com/stan-dev/example-models/issues/48#issuecomment-191465758

.

— Reply to this email directly or view it on GitHub < https://github.com/stan-dev/example-models/issues/48#issuecomment-191476458 .

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48#issuecomment-191479838 .

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48#issuecomment-191498881.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/example-models/issues/48#issuecomment-191869715.

— Reply to this email directly or view it on GitHub.

Mar 03 '16 22:03 bob-carpenter

example-models example-models copied to clipboard

Whitelist of example models

example-models
example-models copied to clipboard