design-docs
design-docs copied to clipboard
Added regression syntax design doc
I've been thinking about some sort of regression syntax in Stan. It's kind of a follow up to Speeding up hierarchical models. When I say regression syntax I'm talking about the interfaces (brms/rstanarm/lme4/glmer/lm/...) that take in something like the R regression syntax (y ~ 1 + x + (1 | group) + ...) and do something with it.
It's probably something that has come up before, but there's just so much there it's hard to talk about it without writing a lot down. So I tried to write down the bare minimum justification and implementation of a regression syntax in Stan.
Stan is written to do regressions and more, so at face value this doesn't seem that important but I'm trying to hit three points:
- Make it possible to avoid design matrix libraries outside of Stan
- Package up big computation to send it off to a GPU/accelerator
- Build design matrices if you need them
Now that I got that written out I have some questions hoooray:
-
@jgabry @paul-buerkner @bgoodri does avoiding external design matrices buy us anything from an interface-writing perspective? I thought it would when I wrote this, but I've never actually written a high level regression interface and so I have my doubts.
-
@SteveBronder @t4c1 @rok-cesnovar is this syntax enough to pack things up into big, GPU-friendly calculations?
-
@dpsimpson is it actually useful to build out design matrices in the way I've written here? I remember the sparse stuff you talked about we do linear algebra on them, but maybe this syntax isn't enough to build the matrices you need and you're left having to pass them in from the outside anyway.
Rendered markdown is here
Tagging @enetsee who is also interested in this topic!
Short version. If we stick to just lme4 stuff initially, there are two cases;
- Don't marginalize out anything, in which case probably dense is enough
- Marginalize out stuff, in which case we need to be able to do sparse matrix stuff (like in the lme4 paper for the Gaussian case, or similarly with Laplace approximations).
This is an extremely cool idea!! and it would be very nice to our Python users (but also to our R users, who currently translate from brms and then modify, which is messy).
And obviously we can add cool things like GPs (dense) and spatial models (often sparse), and smoothing splines (often sparse), and time series models (sparse). It would be ideal if we could keep as much of the brms notation as possible for these complex models...
On Sat, 26 Sep 2020 at 19:08, Ben Bales [email protected] wrote:
I've been thinking about some sort of regression syntax in Stan. It's kind of a follow up to Speeding up hierarchical models https://discourse.mc-stan.org/t/speeding-up-hierarchical-models/14783. When I say regression syntax I'm talking about the interfaces (brms/rstanarm/lme4/glmer/lm/...) that take in something like the R regression syntax (y ~ 1 + x + (1 | group) + ...) and do something with it.
It's probably something that has come up before, but there's just so much there it's hard to talk about it without writing a lot down. So I tried to write down the bare minimum justification and implementation of a regression syntax in Stan.
Stan is written to do regressions and more, so at face value this doesn't seem that important but I'm trying to hit three points:
- Make it possible to avoid design matrix libraries outside of Stan
- Package up big computation to send it off to a GPU/accelerator
- Build design matrices if you need them
Now that I got that written out I have some questions hoooray:
@jgabry https://github.com/jgabry @paul-buerkner https://github.com/paul-buerkner @bgoodri https://github.com/bgoodri does avoiding external design matrices buy us anything from an interface-writing perspective? I thought it would when I wrote this, but I've never actually written a high level regression interface and so I have my doubts. 2.
@SteveBronder https://github.com/SteveBronder @t4c1 https://github.com/t4c1 @rok-cesnovar https://github.com/rok-cesnovar is this syntax enough to pack things up into big, GPU-friendly calculations? 3.
@dpsimpson https://github.com/dpsimpson is it actually useful to build out design matrices in the way I've written here? I remember the sparse stuff you talked about we do linear algebra on them, but maybe this syntax isn't enough to build the matrices you need and you're left having to pass them in from the outside anyway.
Rendered markdown is here https://github.com/bbbales2/design-docs/blob/feature/0026-regression-syntax/designs/0026-stan-regression-syntax.md
You can view, comment on, or merge this pull request online at:
https://github.com/stan-dev/design-docs/pull/27 Commit Summary
- Added regression syntax design doc
File Changes
- A designs/0026-stan-regression-syntax.md https://github.com/stan-dev/design-docs/pull/27/files#diff-efdbb699c90c4837f62d2dbb93124ad2 (256)
Patch Links:
- https://github.com/stan-dev/design-docs/pull/27.patch
- https://github.com/stan-dev/design-docs/pull/27.diff
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/design-docs/pull/27, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRICBWL4WYYLHDYG2VQFRDSHZX6VANCNFSM4R3FTBEA .
So this is just new syntax for something that is already possible? While I am more familiar with the existing syntax I guess that is fine if that syntax is closer to majority of Stan users.
I have to say I do not see what is the point of the design matrix stuff. To me it just looks like forcing sparse algebra, which will probably be slower that dense version. Are there any advantages in terms of usability?
Anyway, the dense should work fine with GPUs. With kernel generator it won't be hard to implement - we can just write something very similar to the vectorized form of the model. Although the CPU implementation will be hard to optimize, due to the version of Eigen we are using not supporting indexing with vectors. On the other hand we do not have any GPU implementations for sparse operations yet.
From my point of view, this isn’t a speed thing. It’s shifting something that is currently done separately for each interface (ie a thing that is only available in RStan) into the language so that there is consistent access to it.
It also lets us build a brms-like higher lever interface for all of the versions of Stan, which should help users immensely.
A thing that is annoying is that that means the new syntax will need to consistently support factors, which is the thing that model.matrix in R does really well.
On Mon, Sep 28, 2020 at 05:23 Tadej Ciglarič [email protected] wrote:
So this is just new syntax for something that is already possible? While I am more familiar with the existing syntax I guess that is fine if that syntax is closer to majority of Stan users.
I have to say I do not see what is the point of the design matrix stuff. To me it just looks like forcing sparse algebra, which will probably be slower that dense version. Are there any advantages in terms of usability?
Anyway, the dense should work fine with GPUs. With kernel generator it won't be hard to implement - we can just write something very similar to the vectorized form of the model. Although the CPU implementation will be hard to optimize, due to the version of Eigen we are using not supporting indexing with vectors. On the other hand we do not have any GPU implementations for sparse operations yet.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/design-docs/pull/27#issuecomment-699890033, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRICBWCVHAGSZPTNB5OVSDSIBIYVANCNFSM4R3FTBEA .
@seantalts - thanks for the nudge!
@bbbales2, as Sean mentioned, I'm very interested in this topic and is a use case I had in mind when we were looking at the structure of the compiler. Rather than adding this as new syntax to Stan the language, the idea was to have an alternative frontend just for regression models which elaborated to core Stan or the intermediate representation we use in the compiler (i.e. we have an additional step where we recover anything that was left implicit in the terser language of regression).
There is some really interesting work by Andy Gordon on a 'regression calculus' here (which is just a fancy way of describing a core language for regression along with its meaning) which I was keen on extending to add BRMS like functionality. I spoke with Andy Gordon about this briefly at StanCon in Cambridge but I haven't had too much time to follow up of late.
There was also a presentation at StanCon of another front end language, Chronikis which was targeted at time series models and again, elaborated to Stan.
In terms of motivation, I think we have to recognize that syntax is important: it facilitates the expression of recurring patterns and allows us to restrict expressiveness. With specialized front-end we can also introduce stricter semantics allowing us to pick up errors more easily. The very fact that BRMS has a large user base is evidence of the utility of this approach.
I would be very keen to get involved with this from a language design perspective.
I agree this is a very cool idea, and I am happy to provide my expertise with brms in making this possible. Essentially, I do see that a little bit like (a subset of) brms but directly in Stan with potentially much more opportunities to optimize the code on a lower level (brms does try to write optimized Stan code but of course has only access to the top-level of the Stan language itself). Additional advantages would be that we would get methods such as posterior_predict and log_lik (using our R names right now) for these models directly in Stan, which is otherwise a little ugly to do in the generated quantities block.
There are obvious challenges of course but also a few perhaps not as obvious ones, for example, how to deal with predicting new data (which is a quite relevant aspect in brms users' workflows) or how to handle predictions for new levels of the groupings factors. Ideally, we could even achieve some sort of marginalization not only for estimation but also for predictions (e.g., as discussed in https://github.com/paul-buerkner/brms/issues/489). Some of these features may go to far for the Stan regression syntax but would definitely improve its usability a lot.
Interesting! I'd love to be involved in some of this discussion. Also could someone tag Bob Carpenter (I'm not sure how to do that), as he and Matt Hoffman and I were talking a lot about these issues back in 2011. Also Jonah/BenG/Aki, because this stuff would be directly relevant to our forthcoming book on applied regression and multilevel models. Andrew
On Sep 29, 2020, at 2:23 AM, Paul-Christian Bürkner [email protected] wrote:
I agree this is a very cool idea, and I am happy to provide my expertise with brms in making this possible. Essentially, I do see that a little bit like (a subset of) brms but directly in Stan with potentially much more opportunities to optimize the code on a lower level (brms does try to write optimized Stan code but of course has only access to the top-level of the Stan language itself). Additional advantages would be that we would get methods such as posterior_predict and log_lik (using our R names right now) for these models directly in Stan, which is otherwise a little ugly to do in the generated quantities block.
There are obvious challenges of course but also a few perhaps not as obvious ones, for example, how to deal with predicting new data (which is a quite relevant aspect in brms users' workflows) or how to handle predictions for new levels of the groupings factors. Ideally, we could even achieve some sort of marginalization not only for estimation but also for predictions (e.g., as discussed in paul-buerkner/brms#489 https://github.com/paul-buerkner/brms/issues/489). Some of these features may go to far for the Stan regression syntax but would definitely improve its usability a lot.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/design-docs/pull/27#issuecomment-700477214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZYCUOKMQKTJLFGLGKMPHTSIF4PVANCNFSM4R3FTBEA.
@bob-carpenter
And obviously we can add cool things like GPs (dense) and spatial models (often sparse), and smoothing splines (often sparse), and time series models (sparse). It would be ideal if we could keep as much of the brms notation as possible for these complex models...
@dpsimpson Yeah I like the y ~ gp(x) sorta thing that brms does.
Separately I would like some sort of Stan library system with a package manager. I think that's the only practical way we're going to get all the fancy design matrices and stuff that would need built for splines, approx GPs, etc in Stan models. Pumping that stuff through Stan math would be too slow development-wise. I think if we had a more formal/centralized way for people to share their code, we'd see a lot of interesting code getting shared (and avoid everyone having to reinvent things, digging in forums for code snippets, etc.).
So this is just new syntax for something that is already possible? Are there any advantages in terms of usability?
@t4c1 Yeah, having the design matrices and being able to evaluate/use them efficiently is a plus. Part of the goal is usability and part of the goal is to make it easier for the compiler/runtime to recognize that a bunch of calculations are related. A little of both.
There is some really interesting work by Andy Gordon on a 'regression calculus' another front end language, Chronikis
@enetsee The Fabular thing was interesting. Yes this is very much what I want to see in Stan. I like how they talk about breaking the regressions out into different levels. I also like the variables being explicit, but I'm probably not considering some advantage of them being implicit.
I had a look at Chronikis. I was kinda thinking about domain-specific Stan packages (alternative frontends like you're saying, and I talked to @imadmali about one). The problem with front end languages is I don't know compiler stuff and so the extent of compiling I want to do is Python format-strings sorta code generation.
The discourse thread at the top (this one) has sortof a brute-force giant function that we could use for evaluating the right hand side of a hierarchical regression. This would be something we could compile a higher level language to.
The downsides to this are it's not terribly easy to read the syntax, we don't get the design matrices themselves, and there's no hierarchy of regressions like in the regression calculus paper. I don't want to clutter the Stan language up with too much extra syntax, but I also wouldn't want to limit it to a super-conservative syntax either (cuz the intention is lots of people happily use Stan and the interface languages -- not filter everyone from Stan to interfaces).
There are obvious challenges of course but also a few perhaps not as obvious ones
@paul-buerkner Could we talk about the workflow stuff on this sometime? I am rather scared of forgetting something here. Maybe in a couple weeks after the 2.25 release (on the 19th)? @enetsee or @andrewgelman or anyone else if you want to come to this, that's great. I'll update the design doc with whatever we talk about.
Happy to have a chat about this. Just let me know here or via email.
Am So., 4. Okt. 2020 um 23:57 Uhr schrieb Ben Bales < [email protected]>:
And obviously we can add cool things like GPs (dense) and spatial models (often sparse), and smoothing splines (often sparse), and time series models (sparse). It would be ideal if we could keep as much of the brms notation as possible for these complex models...
@dpsimpson https://github.com/dpsimpson Yeah I like the y ~ gp(x) sorta thing that brms does.
Separately I would like some sort of Stan library system with a package manager. I think that's the only practical way we're going to get all the fancy design matrices and stuff that would need built for splines, approx GPs, etc in Stan models. Pumping that stuff through Stan math would be too slow development-wise. I think if we had a more formal/centralized way for people to share their code, we'd see a lot of interesting code getting shared (and avoid everyone having to reinvent things, digging in forums for code snippets, etc.).
So this is just new syntax for something that is already possible? Are there any advantages in terms of usability?
@t4c1 https://github.com/t4c1 Yeah, having the design matrices and being able to evaluate/use them efficiently is a plus. Part of the goal is usability and part of the goal is to make it easier for the compiler/runtime to recognize that a bunch of calculations are related. A little of both.
There is some really interesting work by Andy Gordon on a 'regression calculus' another front end language, Chronikis
@enetsee https://github.com/enetsee The Fabular thing was interesting. Yes this is very much what I want to see in Stan. I like how they talk about breaking the regressions out into different levels. I also like the variables being explicit, but I'm probably not considering some advantage of them being implicit.
I had a look at Chronikis. I was kinda thinking about domain-specific Stan packages (alternative frontends like you're saying, and I talked to @imadmali https://github.com/imadmali about one). The problem with front end languages is I don't know compiler stuff and so the extent of compiling I want to do is Python format-strings sorta code generation.
The discourse thread at the top (this one https://discourse.mc-stan.org/t/speeding-up-hierarchical-models/14783) has sortof a brute-force giant function that we could use for evaluating the right hand side of a hierarchical regression. This would be something we could compile a higher level language to.
The downsides to this are it's not terribly easy to read the syntax, we don't get the design matrices themselves, and there's no hierarchy of regressions like in the regression calculus paper. I don't want to clutter the Stan language up with too much extra syntax, but I also wouldn't want to limit it to a super-conservative syntax either (cuz the intention is lots of people happily use Stan and the interface languages -- not filter everyone from Stan to interfaces).
There are obvious challenges of course but also a few perhaps not as obvious ones
@paul-buerkner https://github.com/paul-buerkner Could we talk about the workflow stuff on this sometime? I am rather scared of forgetting something here. Maybe in a couple weeks after the 2.25 release (on the 19th)? @enetsee https://github.com/enetsee or @andrewgelman https://github.com/andrewgelman or anyone else if you want to come to this, that's great. I'll update the design doc with whatever we talk about.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/design-docs/pull/27#issuecomment-703321865, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCW2AG5DHNRNSUB6PU4PBLSJDVUHANCNFSM4R3FTBEA .
@paul-buerkner @enetsee @andrewgelman are you free any time in the next couple weeks to talk about the not-so-obvious challenges here?
Edit: I'm available basically anytime. If there's no obvious other preferable time, before or after the Stan meeting Thursday is good.
I would be flexible on most days as well. Feel free to suggest a time.
@paul-buerkner let's plan on before the Stan meeting then. That's 2PM UTC Thursday (10AM EST) if I'm doing my math correctly.
@enetsee if you want to come and that doesn't work, pick another time. Sounds like we're both flexible. I'll e-mail a call link before then.
If you want to talk this week, would it also work after the Stan meeting?
After is good too, so let's say 4PM UTC, noon EST
Hi, just curious what happened with this?
@andrewgelman had a meeting and I never updated the design doc.
There were some things there worth writing down, but the thing that was nagging me when I was trying to update this were unresolved problems with defining parameters that might have different sizes as a function of data.
Problems like, you have an interaction:
parameters {
matrix[N_ages, N_educs] alpha0;
}
And then you have a model where in your data you don't have at least one of all the ages and educations.
I sorta got stuck here. And if you don't solve this problem you still leave a lot of annoying shuffling of stuff to the outside.
I'm not sure I talk about this problem in the design doc, but I think I am assuming that it's solved in some way, and that seems like a big assumption.