bayesplot Adding LOO PIT vignette

Adding LOO PIT vignette

Open ecoronado92 opened this issue 4 years ago • 30 comments

Opening issue to discuss adding the following tutorial

https://tinyurl.com/loo-pit-tutorial

Jul 02 '20 22:07 ecoronado92

Thanks @ecoronado92! I've wanted to have some more intuitive demonstrations of the LOO PIT stuff for a while, so this would be great to have as a vignette. I'll try to take a look at the source at https://github.com/ecoronado92/towards_data_science/blob/master/hierarchical_models/bayes_lmm/loo_pit_example/loo_pit.Rmd soon and give you some comments about anything that might need to be changed for this to work as a package vignette for CRAN (and any other comments I have if that's ok). Oh and just a heads up: if I don't get to it before Saturday it might not happen until the week after next (I'm moving next week).

Thanks again for making this tutorial and I'll get back to you soon!

Jul 02 '20 22:07 jgabry

@jgabry no problem! I really enjoyed your paper, it has taught me so much about Bayesian models evaluation outside of PPCs. (I enjoyed it so much I wrote a couple Towards Data Science posts about it haha)

Thanks again, good luck with the move and Happy 4th of July!

Jul 02 '20 23:07 ecoronado92

Hey @jgabry - hope the move went well! Just wanted to follow-up to continue our conversation :)

Jul 24 '20 02:07 ecoronado92

Thanks for the proposal. The reasons for me why I haven't worked on vignette are

we should fix first the density estimate as it now is dropping near edges while it should look uniform for actually uniform data, see isssue #171
there should be a note saying that discrete case requires some additional work Predictive Model Assessment for Count Data (doesn't seem to have open issue yet)

For 1. we are going to soon have also something that doesn't have the problems of kernel density estimate and histogram, but quick fix could also be useful. 2. has been delayed as we've been working on better approach for 1.

@ecoronado92 are you interested in a) fixing the density estimate using a kernel density estimate with boundary conditions, or b) adding nonrandomized uniform version of the PIT histogram as described in the reference mentioned above? I can make a separate issue.

Jul 29 '20 12:07 avehtari

@avehtari No worries, thanks for getting back to me. Ah I see, so these two things would need to be updated in the demo I made so it can be included as a package vignette?

I'm happy to address 1. with a quick fix using the packages mentioned in #171 and then we can move forward on the other points. Since my demo doesn't use the ppc_loo_pit_overlay() function and instead builds the empirical LOO PIT from scratch, should I add these changes to my demo or open a pull request?

Let me know if that works.

Jul 29 '20 16:07 ecoronado92

Hmmm... There's a problem that we wouldn't like to add more dependencies to other packages (even that I suggested couple packages), so we should carefully consider what to do with this. @jgabry what do you think?

There is also an alternative ECDF and ECDF-diff based approach which would be better than density estimates and we should soon have code public for that.

Since my demo doesn't use the ppc_loo_pit_overlay() function and instead builds the empirical LOO PIT from scratch, should I add these changes to my demo or open a pull request?

For a vignette to be included in bayesplot package, it would be good to have ppc_loo_pit_overlay() used.

Jul 30 '20 11:07 avehtari

@avehtari Ok, that makes sense...I can add ppc_loo_pit_overlay() but will wait to see what we decide on the ECDF part to start refactoring the current demo

Jul 30 '20 12:07 ecoronado92

Sorry I haven't chimed in yet! After I moved we also ended up having some issues with the latest RStan release and I've been spending a lot of time responding to people with installation issues.

There's a problem that we wouldn't like to add more dependencies to other packages (even that I suggested couple packages), so we should carefully consider what to do with this. @jgabry what do you think?

Yeah it'd be better to have our own implementation that fixes the problem with the density plots. I'd rather not add another dependency just to call one function. However, we could add it to Suggests instead of Imports and then it wouldn't be required to have it when loading bayesplot, just for using the relevant function.

@ecoronado92 Do you want to take a look at those other packages and check if (a) it would be difficult to do our own implementation of the better density estimates or, if that seems like too much effort, then (b) if you can use one of those packages to fix the problem, in which case we'll add it as a suggested package?

There is also an alternative ECDF and ECDF-diff based approach which would be better than density estimates and we should soon have code public for that.

@avehtari It would be great to have this. Is one of your students working on this?

If we can fix the problem with the density estimation then I think it's probably worth adding this vignette and then updating it when the ECDF and ECDF-diff stuff is ready. Otherwise we can wait until we have that.

What do you both think?

Aug 03 '20 17:08 jgabry

@avehtari It would be great to have this. Is one of your students working on this?

Yes. It takes some time to have it in vignette as it's not just a new plot, so we need to have proper description of the new parts and experiments to demonstrate that the behavior.

Aug 03 '20 18:08 avehtari

@ecoronado92 Do you want to take a look at those other packages and check if (a) it would be difficult to do our own implementation of the better density estimates or, if that seems like too much effort, then (b) if you can use one of those packages to fix the problem, in which case we'll add it as a suggested package?

@jgabry Yes, I can look into those implementations in the packages suggested in #171 and see how difficult it could be to implement it from scratch. Hopefully it's straightforward.

How should I proceed in either case - a pull request in the script for ppc_loo_pit_overlay or would you prefer I start tinkering my demo linked above?

For a vignette to be included in bayesplot package, it would be good to have ppc_loo_pit_overlay() used.

@avehtari Apologies, I mispoke. I do use the ppc_loo_pit_overlay() in the demo as the ground truth. The demo approximates the ppc_loo... plot by computing the same empirical CDF pvals using a brute-force loo refitting instead of PSIS, and uses animations so show this process to build some intuition.

Aug 04 '20 15:08 ecoronado92

Ok great thanks!

How should I proceed in either case - a pull request in the script for ppc_loo_pit_overlay or would you prefer I start tinkering my demo linked above?

Whichever you prefer is fine by me. Presumably if you do it by tinkering with your demo and you get it working then we could apply the same approach in the ppc_loo_pit_overlay() code.

Aug 05 '20 18:08 jgabry

@jgabry I ran some quick simulations and tested one of the packages. I wanted to run these by you before diving into more coding. Is this what we're trying to achieve?

Below is the code I used to generate a small random unif sample and some histograms with KDEs in blue (stats density()) and red (boundary-corrected)

library(evmix)
set.seed(15)
n=100 
x = runif(n)
xx = seq(-0.1, 1.1, 0.001)
hist(x, freq=FALSE, col = rgb(0.2,0.2,0.2,0.2))
lines(xx, dbckden(xx, x, lambda=bw.nrd0(x)/2,  xmax=1, bcmethod = "beta2", kernel = "gaussian"), 
      lwd = 2, col = "red")
lines(density(x), lty = 2, lwd = 2, col = "blue")

Aug 05 '20 20:08 ecoronado92

Cool thanks! Yeah I think the red curve looks much better. @avehtari is this what you had in mind for the corrected density estimation?

Aug 05 '20 22:08 jgabry

Can you try with larger n? Preferably the same n a used in bayesplot to generate the bunch of reference density lines

Aug 06 '20 06:08 avehtari

@avehtari Let me know if this looks like what you're looking for. This is still using the evmix package function but can dig deeper and see if we can build our own method

I used the radon example from the current loo pit reference page. Code below (will refactor it if this seems the right path to move forward)

library(rstanarm)
library(loo)
library(tidyverse)
library(evmix)

# Radon example ------
data(radon)

fit <- stan_lmer(
  log_radon ~ floor + log_uranium + floor:log_uranium
  + (1 + floor | county),
  data = radon,
  iter = 1000,
  chains = 2  ,cores = 2
)

y <- radon$log_radon
yrep <- posterior_predict(fit)

loo1 <- loo(fit, save_psis = TRUE, cores = 2)
psis1 <- loo1$psis_object
lw <- weights(psis1)

# Boundary correction ------

xx <- seq(from = -0, to = 1, length.out = length(pit))
samples <- 100 # same as bayesplot

pit <- rstantools::loo_pit(object = yrep, y = y, lw = lw)
bcpit <- dbckden(xx, pit, lambda = bw.nrd0(pit)/2,  xmax=1, bcmethod = "beta2", kernel = "gaussian" )

unifs <- matrix(runif(length(pit) * samples), nrow = samples)
bcunif <- apply(unifs, 1, function(x) dbckden(xx, x, lambda=bw.nrd0(x)/2,  xmax=1, 
                                              bcmethod = "beta2", kernel = "gaussian"))


# Plotting LOO PIT ------

data <- ppc_data(bcpit, t(bcunif)) %>% 
  arrange(rep_id) %>% 
  mutate(xx = rep(xx, times=samples + 1))

ggplot(data) +
  aes_(x = ~ xx, y = ~ value) +
  geom_line(aes_(group = ~rep_id,  color = "yrep"),
            data = function(x) dplyr::filter(x, !.data$is_y),
            alpha = 0.7,
            size = 0.25,
            na.rm = TRUE) +
  geom_line(aes_(color = "y"),
            data = function(x) dplyr::filter(x, .data$is_y),
            size = 1,
            na.rm = TRUE) +
  scale_x_continuous(
    limits = c(0, 1),
    expand = expansion(0, 0),
    breaks = seq(from = .1, to = .9, by = .2)) +
  scale_y_continuous(
    limits = c(0, NA),
    expand = expansion(mult = c(0, .25))) +
  bayesplot_theme_get() +
  yaxis_title(FALSE) +
  xaxis_title(FALSE) +
  yaxis_text(FALSE) +
  yaxis_ticks(FALSE)

Aug 06 '20 15:08 ecoronado92

@avehtari Let me know if this looks like what you're looking for.

Yes! Now the average of yrep lines is constant, even if the variation is slightly bigger near edges (but then we'll have the same variation also for y, so this is what we want)

Aug 06 '20 15:08 avehtari

Cool! I'll dig into the method and will keep you updated

Aug 06 '20 15:08 ecoronado92

Awesome thanks!

Aug 06 '20 15:08 jgabry

@avehtari @jgabry I was able to build some simple methods I think could be added to the ppc-checks.R script.

I've provided an example after tinkering with my demo. https://htmlpreview.github.io/?https://github.com/ecoronado92/towards_data_science/blob/master/hierarchical_models/bayes_lmm/loo_pit_example/loo_pit.html

The methods code can be found here. I structured so that there are two main functions: one to apply the BC to the pit values and another to generate BC runif samples (as in the original code). One caveat right now is that it can be slow (e.g. it takes around 25 secs to generate the 100 BC runifs)

I've included a proposed ppc_loo_pit_overlay2() function to show how I would proceed include these in the function (see Boundary corrected PPC LOO PIT OVERLAY custom function section).

https://github.com/ecoronado92/towards_data_science/blob/master/hierarchical_models/bayes_lmm/loo_pit_example/bc_density.R

Let me know what you think

Aug 07 '20 23:08 ecoronado92

Cool, thanks for this! Sorry it took me a few days to get back to you. I think what you've done looks really good. What do you think @avehtari?

One caveat right now is that it can be slow (e.g. it takes around 25 secs to generate the 100 BC runifs)

Wow yeah that's slow. Presumably that's coming from the bc_pvals() function in the code you shared (and/or the bc_dunif() function it calls)? Do you know where the bottleneck is? A profiling tool like profvis might be helpful for figuring out which line of code is contributing the most to the run time. If we can isolate where the issue is maybe we can figure out a way to speed it up. If we can't (i.e., if the algorithm for boundary correction is just super slow by necessity) then I guess we can live with that (it still seems useful).

Aug 12 '20 19:08 jgabry

@jgabry

Do you know where the bottleneck is?

The bottleneck comes form the generating the BC runifs mostly because it has to two loops: one for each runif and another loop through each element of that runif. The length of these runifs depends on the sample size we have at hand (define by xs <- seq(0, 1, length.out = length(x)) , where x are the LOO PIT values).

So, for example, the for a sample with N=600 there are iter = 600 * 100 = 60,000 happening.

I've been playing around with some dplyr and reshape2 functions (just so it's consistent with the packages current dependencies) to see if I can speed this up without loops using case_when and melt, but I haven't been able to get it work quite yet.

I also looked into the package's rdbckde function and they are even slower since the random sampling is done via inverse CDF.

Aug 12 '20 19:08 ecoronado92

@avehtari @jgabry I hit a deadend using dplyr, tidyr and reshape2 without generating very large df with repeated data which can consume lots of memory and become messy.

Two potential solutions I can think of to speed things up with the current setup: 1) parallelize with foreach and doParallel , or 2) port the current functions to Rcpp Eigen. I'd think the second option is preferred since the package already suggests rstan which imports Rcpp

Aug 12 '20 21:08 ecoronado92

The bottleneck comes form the generating the BC runifs mostly because it has to two loops: one for each runif and another loop through each element of that runif. The length of these runifs depends on the sample size we have at hand (define by xs <- seq(0, 1, length.out = length(x)) , where x are the LOO PIT values).

So, for example, the for a sample with N=600 there are iter = 600 * 100 = 60,000 happening.

Thanks, that makes sense.

Two potential solutions I can think of to speed things up with the current setup: 1) parallelize with foreach and doParallel , or 2) port the current functions to Rcpp Eigen. I'd think the second option is preferred since the package already suggests rstan which imports Rcpp

Yeah right now I can't think of any other good options. But I think as a first step it makes sense to add an argument to the existing function that uses the slow boundary corrected version using just R code. We can leave the current version as the default and document that turning on boundary correction is preferable but slow. Once we have that we can then figure out the best way to speed it up and eventually if it's fast enough we can just make it the default (and probably remove the old way). But this way we at least have the option of using the boundary corrected version in the meantime while we figure out a better way.

What do you think?

Aug 12 '20 23:08 jgabry

Yeah right now I can't think of any other good options. But I think as a first step it makes sense to add an argument to the existing function that uses the slow boundary corrected version using just R code. We can leave the current version as the default and document that turning on boundary correction is preferable but slow. Once we have that we can then figure out the best way to speed it up and eventually if it's fast enough we can just make it the default (and probably remove the old way). But this way we at least have the option of using the boundary corrected version in the meantime while we figure out a better way.

What do you think?

@jgabry I think you're right and this approach is the least disruptive while adding this feature.

I'll create a branch and set up a pull request. Would you mind giving me permission to set up a branch and push changes there?

I'll set up the bc_dunif, bc_pvals and generate_bc_runifs functions under the # internal ---- section in ppc-loo.R to keep things tidy and emulate your current coding style.

Aug 13 '20 02:08 ecoronado92

I think you don't need that double loop, but you can just use directly those runif values for the density estimation for the reference lines.

Aug 13 '20 08:08 avehtari

I think you don't need that double loop, but you can just use directly those runif values for the density estimation for the reference lines.

@avehtari I think I see what you're saying, let me give it a try.

Aug 13 '20 12:08 ecoronado92

@avehtari You were right, I removed one loop and it now takes ~10 secs for 600 obs and ~20 secs for 1k obs. Still not optimal, but better (for context, the ~25 sec time I mentioned earlier was for 600 obs ).

I can't think of anything else top of mind on how to speed it further without C++ or parallelization.

The main bottleneck is in the vapply on the updated method in the link below (@jgabry I did some profiling and this seems to take ~70% of the overall compute time)

https://github.com/ecoronado92/towards_data_science/blob/master/hierarchical_models/bayes_lmm/loo_pit_example/bc_density.R

Aug 13 '20 15:08 ecoronado92

You were right, I removed one loop and it now takes ~10 secs for 600 obs

Good

I can't think of anything else top of mind on how to speed it further without C++ or parallelization.

Based on quick look of the code, I agree with this. Too bad that R is so slow. Anyway, great to have this even if it's slow, and we soon(ish) have something faster based on ecdfs.

Aug 13 '20 16:08 avehtari

I don't know whether this is the right place to ask but does bayesplot package have any plans to add Teemu's ECDF code here soon related to this graphical test paper?

Martin advised me to check with bayesplot regarding this issue in SBC package.

May 03 '21 10:05 hyunjimoon

I don't know whether this is the right place to ask but does bayesplot package have any plans to add Teemu's ECDF code here soon related to this graphical test paper?

Martin advised me to check with bayesplot regarding this issue in SBC package.

I created issue separately #267 for this.

May 06 '21 14:05 TeemuSailynoja

bayesplot bayesplot copied to clipboard

Adding LOO PIT vignette

bayesplot
bayesplot copied to clipboard