bayesplot
bayesplot copied to clipboard
Adding LOO PIT vignette
Opening issue to discuss adding the following tutorial
https://tinyurl.com/loo-pit-tutorial
Thanks @ecoronado92! I've wanted to have some more intuitive demonstrations of the LOO PIT stuff for a while, so this would be great to have as a vignette. I'll try to take a look at the source at https://github.com/ecoronado92/towards_data_science/blob/master/hierarchical_models/bayes_lmm/loo_pit_example/loo_pit.Rmd soon and give you some comments about anything that might need to be changed for this to work as a package vignette for CRAN (and any other comments I have if that's ok). Oh and just a heads up: if I don't get to it before Saturday it might not happen until the week after next (I'm moving next week).
Thanks again for making this tutorial and I'll get back to you soon!
@jgabry no problem! I really enjoyed your paper, it has taught me so much about Bayesian models evaluation outside of PPCs. (I enjoyed it so much I wrote a couple Towards Data Science posts about it haha)
Thanks again, good luck with the move and Happy 4th of July!
Hey @jgabry - hope the move went well! Just wanted to follow-up to continue our conversation :)
Thanks for the proposal. The reasons for me why I haven't worked on vignette are
- we should fix first the density estimate as it now is dropping near edges while it should look uniform for actually uniform data, see isssue #171
- there should be a note saying that discrete case requires some additional work Predictive Model Assessment for Count Data (doesn't seem to have open issue yet)
For 1. we are going to soon have also something that doesn't have the problems of kernel density estimate and histogram, but quick fix could also be useful. 2. has been delayed as we've been working on better approach for 1.
@ecoronado92 are you interested in a) fixing the density estimate using a kernel density estimate with boundary conditions, or b) adding nonrandomized uniform version of the PIT histogram as described in the reference mentioned above? I can make a separate issue.
@avehtari No worries, thanks for getting back to me. Ah I see, so these two things would need to be updated in the demo I made so it can be included as a package vignette?
I'm happy to address 1. with a quick fix using the packages mentioned in #171 and then we can move forward on the other points. Since my demo doesn't use the ppc_loo_pit_overlay() function and instead builds the empirical LOO PIT from scratch, should I add these changes to my demo or open a pull request?
Let me know if that works.
Hmmm... There's a problem that we wouldn't like to add more dependencies to other packages (even that I suggested couple packages), so we should carefully consider what to do with this. @jgabry what do you think?
There is also an alternative ECDF and ECDF-diff based approach which would be better than density estimates and we should soon have code public for that.
Since my demo doesn't use the ppc_loo_pit_overlay() function and instead builds the empirical LOO PIT from scratch, should I add these changes to my demo or open a pull request?
For a vignette to be included in bayesplot package, it would be good to have ppc_loo_pit_overlay() used.
@avehtari Ok, that makes sense...I can add ppc_loo_pit_overlay()
but will wait to see what we decide on the ECDF part to start refactoring the current demo
Sorry I haven't chimed in yet! After I moved we also ended up having some issues with the latest RStan release and I've been spending a lot of time responding to people with installation issues.
There's a problem that we wouldn't like to add more dependencies to other packages (even that I suggested couple packages), so we should carefully consider what to do with this. @jgabry what do you think?
Yeah it'd be better to have our own implementation that fixes the problem with the density plots. I'd rather not add another dependency just to call one function. However, we could add it to Suggests instead of Imports and then it wouldn't be required to have it when loading bayesplot, just for using the relevant function.
@ecoronado92 Do you want to take a look at those other packages and check if (a) it would be difficult to do our own implementation of the better density estimates or, if that seems like too much effort, then (b) if you can use one of those packages to fix the problem, in which case we'll add it as a suggested package?
There is also an alternative ECDF and ECDF-diff based approach which would be better than density estimates and we should soon have code public for that.
@avehtari It would be great to have this. Is one of your students working on this?
If we can fix the problem with the density estimation then I think it's probably worth adding this vignette and then updating it when the ECDF and ECDF-diff stuff is ready. Otherwise we can wait until we have that.
What do you both think?
@avehtari It would be great to have this. Is one of your students working on this?
Yes. It takes some time to have it in vignette as it's not just a new plot, so we need to have proper description of the new parts and experiments to demonstrate that the behavior.
@ecoronado92 Do you want to take a look at those other packages and check if (a) it would be difficult to do our own implementation of the better density estimates or, if that seems like too much effort, then (b) if you can use one of those packages to fix the problem, in which case we'll add it as a suggested package?
@jgabry Yes, I can look into those implementations in the packages suggested in #171 and see how difficult it could be to implement it from scratch. Hopefully it's straightforward.
How should I proceed in either case - a pull request in the script for ppc_loo_pit_overlay
or would you prefer I start tinkering my demo linked above?
For a vignette to be included in bayesplot package, it would be good to have ppc_loo_pit_overlay() used.
@avehtari Apologies, I mispoke. I do use the ppc_loo_pit_overlay()
in the demo as the ground truth.
The demo approximates the ppc_loo...
plot by computing the same empirical CDF pvals using a brute-force loo refitting instead of PSIS, and uses animations so show this process to build some intuition.
Ok great thanks!
How should I proceed in either case - a pull request in the script for
ppc_loo_pit_overlay
or would you prefer I start tinkering my demo linked above?
Whichever you prefer is fine by me. Presumably if you do it by tinkering with your demo and you get it working then we could apply the same approach in the ppc_loo_pit_overlay()
code.
@jgabry I ran some quick simulations and tested one of the packages. I wanted to run these by you before diving into more coding. Is this what we're trying to achieve?
Below is the code I used to generate a small random unif sample and some histograms with KDEs in blue (stats density()
) and red (boundary-corrected
)
library(evmix)
set.seed(15)
n=100
x = runif(n)
xx = seq(-0.1, 1.1, 0.001)
hist(x, freq=FALSE, col = rgb(0.2,0.2,0.2,0.2))
lines(xx, dbckden(xx, x, lambda=bw.nrd0(x)/2, xmax=1, bcmethod = "beta2", kernel = "gaussian"),
lwd = 2, col = "red")
lines(density(x), lty = 2, lwd = 2, col = "blue")
![Screen Shot 2020-08-05 at 4 01 24 PM](https://user-images.githubusercontent.com/42551278/89458390-f4968d80-d734-11ea-8c04-3a40b598ea40.png)
Cool thanks! Yeah I think the red curve looks much better. @avehtari is this what you had in mind for the corrected density estimation?
Can you try with larger n? Preferably the same n a used in bayesplot to generate the bunch of reference density lines
@avehtari Let me know if this looks like what you're looking for. This is still using the evmix
package function but can dig deeper and see if we can build our own method
I used the radon
example from the current loo pit reference page. Code below (will refactor it if this seems the right path to move forward)
![Screen Shot 2020-08-06 at 11 36 43 AM](https://user-images.githubusercontent.com/42551278/89551629-26f9c680-d7d9-11ea-9e22-87b849accbb6.png)
library(rstanarm)
library(loo)
library(tidyverse)
library(evmix)
# Radon example ------
data(radon)
fit <- stan_lmer(
log_radon ~ floor + log_uranium + floor:log_uranium
+ (1 + floor | county),
data = radon,
iter = 1000,
chains = 2 ,cores = 2
)
y <- radon$log_radon
yrep <- posterior_predict(fit)
loo1 <- loo(fit, save_psis = TRUE, cores = 2)
psis1 <- loo1$psis_object
lw <- weights(psis1)
# Boundary correction ------
xx <- seq(from = -0, to = 1, length.out = length(pit))
samples <- 100 # same as bayesplot
pit <- rstantools::loo_pit(object = yrep, y = y, lw = lw)
bcpit <- dbckden(xx, pit, lambda = bw.nrd0(pit)/2, xmax=1, bcmethod = "beta2", kernel = "gaussian" )
unifs <- matrix(runif(length(pit) * samples), nrow = samples)
bcunif <- apply(unifs, 1, function(x) dbckden(xx, x, lambda=bw.nrd0(x)/2, xmax=1,
bcmethod = "beta2", kernel = "gaussian"))
# Plotting LOO PIT ------
data <- ppc_data(bcpit, t(bcunif)) %>%
arrange(rep_id) %>%
mutate(xx = rep(xx, times=samples + 1))
ggplot(data) +
aes_(x = ~ xx, y = ~ value) +
geom_line(aes_(group = ~rep_id, color = "yrep"),
data = function(x) dplyr::filter(x, !.data$is_y),
alpha = 0.7,
size = 0.25,
na.rm = TRUE) +
geom_line(aes_(color = "y"),
data = function(x) dplyr::filter(x, .data$is_y),
size = 1,
na.rm = TRUE) +
scale_x_continuous(
limits = c(0, 1),
expand = expansion(0, 0),
breaks = seq(from = .1, to = .9, by = .2)) +
scale_y_continuous(
limits = c(0, NA),
expand = expansion(mult = c(0, .25))) +
bayesplot_theme_get() +
yaxis_title(FALSE) +
xaxis_title(FALSE) +
yaxis_text(FALSE) +
yaxis_ticks(FALSE)
@avehtari Let me know if this looks like what you're looking for.
Yes! Now the average of yrep lines is constant, even if the variation is slightly bigger near edges (but then we'll have the same variation also for y, so this is what we want)
Cool! I'll dig into the method and will keep you updated
Awesome thanks!
@avehtari @jgabry I was able to build some simple methods I think could be added to the ppc-checks.R
script.
I've provided an example after tinkering with my demo. https://htmlpreview.github.io/?https://github.com/ecoronado92/towards_data_science/blob/master/hierarchical_models/bayes_lmm/loo_pit_example/loo_pit.html
The methods code can be found here. I structured so that there are two main functions: one to apply the BC to the pit values and another to generate BC runif samples (as in the original code). One caveat right now is that it can be slow (e.g. it takes around 25 secs to generate the 100 BC runifs)
I've included a proposed ppc_loo_pit_overlay2()
function to show how I would proceed include these in the function (see Boundary corrected PPC LOO PIT OVERLAY custom function section).
https://github.com/ecoronado92/towards_data_science/blob/master/hierarchical_models/bayes_lmm/loo_pit_example/bc_density.R
Let me know what you think
Cool, thanks for this! Sorry it took me a few days to get back to you. I think what you've done looks really good. What do you think @avehtari?
One caveat right now is that it can be slow (e.g. it takes around 25 secs to generate the 100 BC runifs)
Wow yeah that's slow. Presumably that's coming from the bc_pvals()
function in the code you shared (and/or the bc_dunif()
function it calls)? Do you know where the bottleneck is? A profiling tool like profvis might be helpful for figuring out which line of code is contributing the most to the run time. If we can isolate where the issue is maybe we can figure out a way to speed it up. If we can't (i.e., if the algorithm for boundary correction is just super slow by necessity) then I guess we can live with that (it still seems useful).
@jgabry
Do you know where the bottleneck is?
The bottleneck comes form the generating the BC runifs mostly because it has to two loops: one for each runif and another loop through each element of that runif. The length of these runifs depends on the sample size we have at hand (define by xs <- seq(0, 1, length.out = length(x))
, where x
are the LOO PIT values).
So, for example, the for a sample with N=600
there are iter = 600 * 100 = 60,000
happening.
I've been playing around with some dplyr
and reshape2
functions (just so it's consistent with the packages current dependencies) to see if I can speed this up without loops using case_when
and melt
, but I haven't been able to get it work quite yet.
I also looked into the package's rdbckde
function and they are even slower since the random sampling is done via inverse CDF.
@avehtari @jgabry I hit a deadend using dplyr
, tidyr
and reshape2
without generating very large df with repeated data which can consume lots of memory and become messy.
Two potential solutions I can think of to speed things up with the current setup: 1) parallelize with foreach
and doParallel
, or 2) port the current functions to Rcpp
Eigen. I'd think the second option is preferred since the package already suggests rstan
which imports Rcpp
The bottleneck comes form the generating the BC runifs mostly because it has to two loops: one for each runif and another loop through each element of that runif. The length of these runifs depends on the sample size we have at hand (define by
xs <- seq(0, 1, length.out = length(x))
, wherex
are the LOO PIT values).So, for example, the for a sample with
N=600
there areiter = 600 * 100 = 60,000
happening.
Thanks, that makes sense.
Two potential solutions I can think of to speed things up with the current setup: 1) parallelize with
foreach
anddoParallel
, or 2) port the current functions toRcpp
Eigen. I'd think the second option is preferred since the package already suggestsrstan
which importsRcpp
Yeah right now I can't think of any other good options. But I think as a first step it makes sense to add an argument to the existing function that uses the slow boundary corrected version using just R code. We can leave the current version as the default and document that turning on boundary correction is preferable but slow. Once we have that we can then figure out the best way to speed it up and eventually if it's fast enough we can just make it the default (and probably remove the old way). But this way we at least have the option of using the boundary corrected version in the meantime while we figure out a better way.
What do you think?
Yeah right now I can't think of any other good options. But I think as a first step it makes sense to add an argument to the existing function that uses the slow boundary corrected version using just R code. We can leave the current version as the default and document that turning on boundary correction is preferable but slow. Once we have that we can then figure out the best way to speed it up and eventually if it's fast enough we can just make it the default (and probably remove the old way). But this way we at least have the option of using the boundary corrected version in the meantime while we figure out a better way.
What do you think?
@jgabry I think you're right and this approach is the least disruptive while adding this feature.
I'll create a branch and set up a pull request. Would you mind giving me permission to set up a branch and push changes there?
I'll set up the bc_dunif
, bc_pvals
and generate_bc_runifs
functions under the # internal ----
section in ppc-loo.R
to keep things tidy and emulate your current coding style.
I think you don't need that double loop, but you can just use directly those runif values for the density estimation for the reference lines.
I think you don't need that double loop, but you can just use directly those runif values for the density estimation for the reference lines.
@avehtari I think I see what you're saying, let me give it a try.
@avehtari You were right, I removed one loop and it now takes ~10 secs for 600 obs and ~20 secs for 1k obs. Still not optimal, but better (for context, the ~25 sec time I mentioned earlier was for 600 obs ).
I can't think of anything else top of mind on how to speed it further without C++ or parallelization.
The main bottleneck is in the vapply
on the updated method in the link below (@jgabry I did some profiling and this seems to take ~70% of the overall compute time)
https://github.com/ecoronado92/towards_data_science/blob/master/hierarchical_models/bayes_lmm/loo_pit_example/bc_density.R
You were right, I removed one loop and it now takes ~10 secs for 600 obs
Good
I can't think of anything else top of mind on how to speed it further without C++ or parallelization.
Based on quick look of the code, I agree with this. Too bad that R is so slow. Anyway, great to have this even if it's slow, and we soon(ish) have something faster based on ecdfs.