ggdist
ggdist copied to clipboard
Stan distribution functions
I love the new stat_dist_halfeye()
function. However, I think the current approach of matching stan distribution names to existing R functions (see below) may not be ideal. In some cases, the R function differs from the stan function in argument names, orderings, and even parameterizations.
https://github.com/mjskay/tidybayes/blob/ed254b9920e8407536f3f2e5f7f4b034e85a0df3/R/parse_dist.R#L150-L167
To address this issue, I have been creating an R function for each stan distribution function that remains true to the stan function. See my progress at github.com/jmgirard/standist.
I wonder if you might consider using these functions in the lookup table above (similar to what is already being done for brms::student_t()
), or permitting me to do so via PR. If you are interested, I would be open to either importing from standist or rolling this project into tidybayes.
On a related note, in that same standist repo linked above, I am working on easy single-line function calls to visualize stan distributions (for pedagogical purposes and for selecting priors). Something like viz("student_t(3, 0, 10)")
or viz("student_t", arg1 = seq(1, 10, 2), arg2 = 0, arg3 = 1)
. I have some basic implementation of this using ggplot2::stat_function()
but I think it'd be more robust and flexible to use your tidybayes::stat_dist_halfeye()
function. Any interest in collaborating on this?
Yeah this sounds like a great idea! Would love a PR to make this work... Would need to be careful to avoid circular dependencies if you want to build standist::viz() on top of tidybayes.
One option might be to come up with a more formal way of registering distribution mappings or of choosing desired mappings. The current approach is very ad hoc and could (should) be better. This would also help deal with the fact that if someone is using tidybayes with jags, say, they would want different parameterization than Stan; or if they wanted a base R parameterization instead of a Stan one.
So, yeah, happy to collaborate!
I hadn't considered that some users might want the jags or R parameterizations. In this case, making it modular/customizable seems like the way to go. Perhaps parse_dist()
could be made to guess which parameterization was requested and also include an explicit argument for it.
Currently the approach is that if to_r_names = TRUE
then it uses the lookup table on normalized names and otherwise does nothing. A more generic approach would be to either allow people to swap lookup tables or give an arbitrary lookup function. The lookup table approach is probably simpler to extend except in cases where people want a more complex lookup function---but i'm struggling a bit to determine what that might look like. So one way to adjust things would be to add a parameter to r_dist_name()
that specifies the lookup table, and add that same parameter to parse_dist()
and pass it through to r_dist_name()
, give it a sensible default and then have standist
export an object containing its own lookup table that could be passed to that parameter. Thoughts?
I wonder how many total lookup tables will be necessary. I could see one for R, one for stan, one for jags. Any others? Then an argument could point to one of these tables explicitly as you said or, in the absence of an explicit argument being provided, perhaps we could search all tables for a matching name and return a warning/error if the number of matches does not equal 1.
Yeah, there's a heavier-weight and lighter-weight variant:
-
Lookup tables are not "registered" in any way; any named character vector could be used as a lookup table. So some objects might be provided with names like
stan_dists
orr_dists
or what have you that can be passed in as an argument to parse_dist. Pros: simple interface. Cons: could make life complicated if other packages have no way to "register" distributions to one of these lists (or would mean users have to manually combine lists in some cases). On the other hand, that might be a problem that only exists in theory (if no other packages adopt this functionality the problem never arises, so spending engineering effort solving it is a waste). -
Lookup tables are "registered" through some function (and maybe also have a mechanism for packages to register functions onto specific lookup tables or something). This could make usage simpler for many common use cases (e.g. allowing users to just pass a list name). It does make things a bit more "magical" from an api-understanding perspective in that there is a lookup-table registration mechanism in the background that users aren't exposed to and might find harder to debug when things go wrong.
My guess is that there will only ever be a handful of tables needed and we can probably anticipate what those will be and provide them in the package. Then we could have a small vignette or help doc of how to add a custom table for advanced users. So maybe "registration" is not needed.
Hmmm okay, I'm trying to think of a practical solution here that doesn't introduce a bunch more dependencies into tidybayes (which is already reasonably heavy on the deps department) and which wouldn't introduce a circular dependency with standist, while allowing us to also provide good error messages (as currently parse_dist
errors are decidedly not good).
One could imagine lookup tables like:
list(
normal = "stats::norm",
lognormal = "stats::lnorm",
studentt = "brms::student_t",
)
Where parse_dist
could then parse those specs into package and function family name and throw an error if the necessary package is not available during parsing (and maybe a warning if it's not loaded...). Or perhaps it would go looking for the corresponding functions and if they aren't found, check for the specified package. Another format might be more explicit, like:
list(
normal = list("norm", package = "stats"),
lognormal = list("lnorm", package = "stats"),
studentt = list("norm", package = "brms"),
)
Where the use of the list()
format could be optional (e.g. if a character vector of length one is provided, just search for that distribution in the loaded packages). This would also allow future extensions to search multiple packages if that's ever needed (e.g. by providing a vector to package=
).
Then some preconfigured lists could be provided, like:
r_distributions = list(...)
stan_distributions = list(...)
jags_distributions = list(...)
all_distributions = modifyList(
modifyList(jags_distributions, stan_distributions), r_distributions
)
And the default would be all_distributions
. I have some mixed feelings about dropping these variable names into the package environment, might want to think about that a bit...
I'm going to follow your lead here, as this is your package and you're more experienced in package development than I am. I'm happy to provide a second opinion whenever helpful, but my plan is to let you decide how you want this stuff formatted. Then I can take the lead in gathering the functions up and creating the lists. I am not so familiar with JAGS so I would likely begin with implementing this for the R and STAN functions.
This does seem like a good candidate for inclusion in ggdist. Let me know when you have time to return to it.
Yeah, there's a couple of things to consider here in light of #14.
I suspect long-term my plan is to rethink parse_dist()
a bit and have it output distribution vectors a la #14 rather than the string-plus-args approach. If https://github.com/mitchelloharawild/distributional/issues/30 is implemented than I could imagine using it to wrap arbitrary stan distributions from brms priors and outputting those.
On the other hand, it may then be the case that ggdist
isn't the right place for the solution to this problem. Another way forward would be to petition @paul-buerkner to put a function in brms that outputs {distributional} vectors of brms priors, possibly using the {standist} package. Then those vectors would automatically be supported by the upcoming version of {ggdist} without having to use ggdist::parse_dist()
at all (the stat_dist_...
geoms on the dev branch of ggdist can already plot such objects). Would be curious what @paul-buerkner thinks about something like that?
I have not yet looked at this in detail. Would you mind opening an issue on the brms issue tracker (https://github.com/paul-buerkner/brms/issues)?
The {distributional} package can now wrap arbitrary distributions (https://github.com/mitchelloharawild/distributional/issues/30). With the recent addition of {distributional} support in {ggdist}, it should now be possible to plot arbitrary distributions:
library(ggplot2)
library(dplyr)
library(ggdist)
library(brms)
library(distributional)
tribble(
~ group, ~ dist,
"Normal", dist_normal(mu = 0, sigma = 1),
"Skewed", dist_wrap("skew_normal", package = "brms", mu = 0, sigma = 1, alpha = 10)
) %>%
ggplot(aes(x = group, dist = dist, fill = group)) +
stat_dist_eye(position = "dodge")
Created on 2020-07-14 by the reprex package (v0.3.0)
I think this problem is basically solved from the {ggdist} perspective: parse_dist()
has (for awhile now) supported arbitrary distributions in via distirbutiona::dist_wrap()
, and passes down the package
argument. Thus, if someone wanted to support an arbitrary set of distributions defined in some other package or environment, they can pass that package or environment to parse_dist()
and then it should find the distributions and set up the .dist_obj
object correctly.