tidybayes icon indicating copy to clipboard operation
tidybayes copied to clipboard

make functions that skip data frames of draws and just give intervals

Open mjskay opened this issue 7 years ago • 5 comments

These would take a point_interval function (these act like combining spread_samples / gather_samples / add_predicted_samples / etc with point_interval, but be faster by preventing the need to create the larger data frame):

  • [gather|spread]_intervals ? (not a great name)
  • [add|spread|gather]_predicted_intervals
    • analogous to rstanarm::posterior_predict, rethinking::sim, brms::predict, modelr::[add|spread|gather]_predictions
  • [add|spread|gather]_fitted_intervals
    • analogous to rethinking::link, brms::fitted

Since an equivalent form already exists and these are just an efficiency thing, punting this till post-CRAN.

mjskay avatar Aug 15 '17 04:08 mjskay

Is this still on the to-do list? I am working with a large (but not huge) brms model (a little under 1GB). When I try to extract 1000 samples from it, I first run out of memory (on a 24 core MacPro with 64 GB RAM) and when I up the memory, I run into an issue that traces back to a problem with tidyverse's vctrs (described here https://github.com/r-lib/vctrs/issues/598). At the core of both problems is, I think, the amount of data associated with first getting the the draws.

Context: The tibble d has 68,600 observations for 8 variable. I want to plot the prediction and uncertainty therein as several lineribbons. Currently, I am doing that by first adding fitted draws:

d %<>%
    add_fitted_draws(model = m, n = 1000, re_formula = NULL,
                     scale = "linear")

That's when I get the error described at https://github.com/r-lib/vctrs/issues/598:

r: Internal error in `dict_hash_with()`: Dictionary is full.

Is there any way through tidybayes to obtain the relevant credible intervals directly without first having to create the 1000 * 68,800 tibble? Or am I perhaps misunderstanding the issue, and it's the number of samples in the model (20,000) that matters despite the fact that I only want 1000 random samples of those? (fwiw, the issue persists even when I just draw 1 single sample, n = 1). Thank you in advance for any pointers you might have, and apologies for not posting a full example (I could link the model and data if that helps).

tfjaeger avatar May 29 '21 02:05 tfjaeger

It's definitely still on the list! It's more a question of when I might get to it. :)

to sketch a possible solution: basically it would be a function like add_fitted_intervals() which would probably batch out calls to add_fitted_draws(), summarizing each batch down to intervals. E.g. instead of trying to do add_fitted_draws on all 68,800 rows of d at once, do (say) 100 or 1000 rows of d at a time, calling median_qi() on each batch. Then you never have to hold a 68,800x(number of draws in model) data frame in memory at once.

mjskay avatar Jun 01 '21 18:06 mjskay

That would be great. Currently, this limitation keeps me from using tidybayes for larger datasets (this one isn't even that​ large -- it's a pretty common size for experiments in the psychological and social sciences). Thank you for considering it =).


From: Matthew Kay @.> Sent: Tuesday, June 1, 2021 2:01 PM To: mjskay/tidybayes @.> Cc: Jaeger, Florian [email protected]; Comment @.***> Subject: [EXT] Re: [mjskay/tidybayes] make functions that skip data frames of draws and just give intervals (#62)

It's definitely still on the list! It's more a question of when I might get to it. :)

to sketch a possible solution: basically it would be a function like add_fitted_intervals() which would probably batch out calls to add_fitted_draws(), summarizing each batch down to intervals. E.g. instead of trying to do add_fitted_draws on all 68,800 rows of d at once, do (say) 100 or 1000 rows of d at a time, calling median_qi() on each batch. Then you never have to hold a 68,800x(number of draws in model) data frame in memory at once.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mjskay_tidybayes_issues_62-23issuecomment-2D852334062&d=DwMCaQ&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=O6dqVFPEDpdoXY3wkv8u6o0LHKx4WbQ_itn0O87jj5s&m=Qqx9VLOoXOBgxX-FgLDziQKfiREdWBB74MMHWuFiPR8&s=rMTOGw_si49m2qyh5adeuM1jWuxUEr7J9BDaTPnHYdE&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADIZTAPBAXKJZB57MO4AGCLTQUN67ANCNFSM4DW4WXAQ&d=DwMCaQ&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=O6dqVFPEDpdoXY3wkv8u6o0LHKx4WbQ_itn0O87jj5s&m=Qqx9VLOoXOBgxX-FgLDziQKfiREdWBB74MMHWuFiPR8&s=lSw6-23UYSTzwybM0lokVNv5iRaeTKWJQJgXdxhreEI&e=.

tfjaeger avatar Jun 01 '21 21:06 tfjaeger

The more I think about this, the more I think a more generic batch + map function would be sufficiently flexible without having to make an interval-generating function for every existing function.

Eg. consider this model:

library(tidyverse)
library(tidybayes)
library(brms)

m = brm(mpg ~ wt, data = mtcars, family = lognormal)

If normally you would do this:

mtcars %>% 
  expand(wt = seq(min(wt), max(wt), length.out = 1000)) %>% 
  add_fitted_draws(m) %>% 
  median_qi()

You just (I think) need to wrap the add_fitted_draws() and median_qi() chunk in something so that you only create the draws table for some small number of input rows at a time.

This is one quick attempt at "batching" the calls so that only 100 rows of the input table are used at a time:

batch_size = 100

mtcars %>% 
  expand(wt = seq(min(wt), max(wt), length.out = 1000)) %>% 
  split(1:nrow(.) %/% batch_size) %>%
  map_dfr(. %>% 
    add_fitted_draws(m) %>% 
    median_qi()
  )

mjskay avatar Jun 02 '21 19:06 mjskay

Thank you. For some reasons specific to my project, this doesn't directly solve my problem, but it is a neat solution! The slicing seems to help even if one doesn't summarize the data within each slice (but only applies add_fitted_draws).


From: Matthew Kay @.> Sent: Wednesday, June 2, 2021 3:44 PM To: mjskay/tidybayes @.> Cc: Jaeger, Florian [email protected]; Comment @.***> Subject: [EXT] Re: [mjskay/tidybayes] make functions that skip data frames of draws and just give intervals (#62)

The more I think about this, the more I think a more generic batch + map function would be sufficiently flexible without having to make an interval-generating function for every existing function.

Eg. consider this model:

library(tidyverse) library(tidybayes) library(brms)

m = brm(mpg ~ wt, data = mtcars, family = lognormal)

If normally you would do this:

mtcars %>% expand(wt = seq(min(wt), max(wt), length.out = 1000)) %>% add_fitted_draws(m) %>% median_qi()

You just (I think) need to wrap the add_fitted_draws() and median_qi() chunk in something so that you only create the draws table for some small number of input rows at a time.

This is one quick attempt at "batching" the calls so that only 100 rows of the input table are used at a time:

batch_size = 100

mtcars %>% expand(wt = seq(min(wt), max(wt), length.out = 1000)) %>% split(1:nrow(.) %/% batch_size) %>% map_dfr(. %>% add_fitted_draws(m) %>% median_qi() )

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mjskay_tidybayes_issues_62-23issuecomment-2D853334853&d=DwMCaQ&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=O6dqVFPEDpdoXY3wkv8u6o0LHKx4WbQ_itn0O87jj5s&m=ivYMYz25UgrdYnNlOzDRrDr1uWOgJoPIsCef4sm2lNw&s=1i8NHBX0q5w3N_HSa5Hj2Zkf27n9HbAwrxX4Hcr0-EI&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADIZTAKUPOQC6O2TRSKNGRDTQ2CYBANCNFSM4DW4WXAQ&d=DwMCaQ&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=O6dqVFPEDpdoXY3wkv8u6o0LHKx4WbQ_itn0O87jj5s&m=ivYMYz25UgrdYnNlOzDRrDr1uWOgJoPIsCef4sm2lNw&s=qaxhguu5iia5h6ltfLdnvGfbNuyXM0BDlIciSUzmpp4&e=.

tfjaeger avatar Jun 03 '21 17:06 tfjaeger