tidybayes
tidybayes copied to clipboard
make functions that skip data frames of draws and just give intervals
These would take a point_interval
function (these act like combining spread_samples
/ gather_samples
/ add_predicted_samples
/ etc with point_interval
, but be faster by preventing the need to create the larger data frame):
-
[gather|spread]_intervals
? (not a great name) -
[add|spread|gather]_predicted_intervals
- analogous to
rstanarm::posterior_predict
,rethinking::sim
,brms::predict
,modelr::[add|spread|gather]_predictions
- analogous to
-
[add|spread|gather]_fitted_intervals
- analogous to
rethinking::link
,brms::fitted
- analogous to
Since an equivalent form already exists and these are just an efficiency thing, punting this till post-CRAN.
Is this still on the to-do list? I am working with a large (but not huge) brms model (a little under 1GB). When I try to extract 1000 samples from it, I first run out of memory (on a 24 core MacPro with 64 GB RAM) and when I up the memory, I run into an issue that traces back to a problem with tidyverse's vctrs (described here https://github.com/r-lib/vctrs/issues/598). At the core of both problems is, I think, the amount of data associated with first getting the the draws.
Context: The tibble d has 68,600 observations for 8 variable. I want to plot the prediction and uncertainty therein as several lineribbons. Currently, I am doing that by first adding fitted draws:
d %<>%
add_fitted_draws(model = m, n = 1000, re_formula = NULL,
scale = "linear")
That's when I get the error described at https://github.com/r-lib/vctrs/issues/598:
r: Internal error in `dict_hash_with()`: Dictionary is full.
Is there any way through tidybayes to obtain the relevant credible intervals directly without first having to create the 1000 * 68,800 tibble? Or am I perhaps misunderstanding the issue, and it's the number of samples in the model (20,000) that matters despite the fact that I only want 1000 random samples of those? (fwiw, the issue persists even when I just draw 1 single sample, n = 1). Thank you in advance for any pointers you might have, and apologies for not posting a full example (I could link the model and data if that helps).
It's definitely still on the list! It's more a question of when I might get to it. :)
to sketch a possible solution: basically it would be a function like add_fitted_intervals()
which would probably batch out calls to add_fitted_draws()
, summarizing each batch down to intervals. E.g. instead of trying to do add_fitted_draws on all 68,800 rows of d
at once, do (say) 100 or 1000 rows of d
at a time, calling median_qi()
on each batch. Then you never have to hold a 68,800x(number of draws in model) data frame in memory at once.
That would be great. Currently, this limitation keeps me from using tidybayes for larger datasets (this one isn't even that large -- it's a pretty common size for experiments in the psychological and social sciences). Thank you for considering it =).
From: Matthew Kay @.> Sent: Tuesday, June 1, 2021 2:01 PM To: mjskay/tidybayes @.> Cc: Jaeger, Florian [email protected]; Comment @.***> Subject: [EXT] Re: [mjskay/tidybayes] make functions that skip data frames of draws and just give intervals (#62)
It's definitely still on the list! It's more a question of when I might get to it. :)
to sketch a possible solution: basically it would be a function like add_fitted_intervals() which would probably batch out calls to add_fitted_draws(), summarizing each batch down to intervals. E.g. instead of trying to do add_fitted_draws on all 68,800 rows of d at once, do (say) 100 or 1000 rows of d at a time, calling median_qi() on each batch. Then you never have to hold a 68,800x(number of draws in model) data frame in memory at once.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mjskay_tidybayes_issues_62-23issuecomment-2D852334062&d=DwMCaQ&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=O6dqVFPEDpdoXY3wkv8u6o0LHKx4WbQ_itn0O87jj5s&m=Qqx9VLOoXOBgxX-FgLDziQKfiREdWBB74MMHWuFiPR8&s=rMTOGw_si49m2qyh5adeuM1jWuxUEr7J9BDaTPnHYdE&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADIZTAPBAXKJZB57MO4AGCLTQUN67ANCNFSM4DW4WXAQ&d=DwMCaQ&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=O6dqVFPEDpdoXY3wkv8u6o0LHKx4WbQ_itn0O87jj5s&m=Qqx9VLOoXOBgxX-FgLDziQKfiREdWBB74MMHWuFiPR8&s=lSw6-23UYSTzwybM0lokVNv5iRaeTKWJQJgXdxhreEI&e=.
The more I think about this, the more I think a more generic batch + map function would be sufficiently flexible without having to make an interval-generating function for every existing function.
Eg. consider this model:
library(tidyverse)
library(tidybayes)
library(brms)
m = brm(mpg ~ wt, data = mtcars, family = lognormal)
If normally you would do this:
mtcars %>%
expand(wt = seq(min(wt), max(wt), length.out = 1000)) %>%
add_fitted_draws(m) %>%
median_qi()
You just (I think) need to wrap the add_fitted_draws() and median_qi() chunk in something so that you only create the draws table for some small number of input rows at a time.
This is one quick attempt at "batching" the calls so that only 100 rows of the input table are used at a time:
batch_size = 100
mtcars %>%
expand(wt = seq(min(wt), max(wt), length.out = 1000)) %>%
split(1:nrow(.) %/% batch_size) %>%
map_dfr(. %>%
add_fitted_draws(m) %>%
median_qi()
)
Thank you. For some reasons specific to my project, this doesn't directly solve my problem, but it is a neat solution! The slicing seems to help even if one doesn't summarize the data within each slice (but only applies add_fitted_draws).
From: Matthew Kay @.> Sent: Wednesday, June 2, 2021 3:44 PM To: mjskay/tidybayes @.> Cc: Jaeger, Florian [email protected]; Comment @.***> Subject: [EXT] Re: [mjskay/tidybayes] make functions that skip data frames of draws and just give intervals (#62)
The more I think about this, the more I think a more generic batch + map function would be sufficiently flexible without having to make an interval-generating function for every existing function.
Eg. consider this model:
library(tidyverse) library(tidybayes) library(brms)
m = brm(mpg ~ wt, data = mtcars, family = lognormal)
If normally you would do this:
mtcars %>% expand(wt = seq(min(wt), max(wt), length.out = 1000)) %>% add_fitted_draws(m) %>% median_qi()
You just (I think) need to wrap the add_fitted_draws() and median_qi() chunk in something so that you only create the draws table for some small number of input rows at a time.
This is one quick attempt at "batching" the calls so that only 100 rows of the input table are used at a time:
batch_size = 100
mtcars %>% expand(wt = seq(min(wt), max(wt), length.out = 1000)) %>% split(1:nrow(.) %/% batch_size) %>% map_dfr(. %>% add_fitted_draws(m) %>% median_qi() )
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mjskay_tidybayes_issues_62-23issuecomment-2D853334853&d=DwMCaQ&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=O6dqVFPEDpdoXY3wkv8u6o0LHKx4WbQ_itn0O87jj5s&m=ivYMYz25UgrdYnNlOzDRrDr1uWOgJoPIsCef4sm2lNw&s=1i8NHBX0q5w3N_HSa5Hj2Zkf27n9HbAwrxX4Hcr0-EI&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADIZTAKUPOQC6O2TRSKNGRDTQ2CYBANCNFSM4DW4WXAQ&d=DwMCaQ&c=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOU&r=O6dqVFPEDpdoXY3wkv8u6o0LHKx4WbQ_itn0O87jj5s&m=ivYMYz25UgrdYnNlOzDRrDr1uWOgJoPIsCef4sm2lNw&s=qaxhguu5iia5h6ltfLdnvGfbNuyXM0BDlIciSUzmpp4&e=.