posterior Arrow support

Arrow support

Open avehtari opened this issue 3 years ago • 1 comments

trafficstars

For a very large number of variables stored in csv, it could be useful to use Arrow to read posterior csv as Arrow data table and use that to let the user to select which variables are actually read to the memory (could be used also for thinning). arrow_table supports dplyr so the implementation of selection and filtering would be relatively easy. It might be easier to just allow this when first time reading the draws from csv, as adding yet another draws type (e.g. draws_arrow_table) would be more work.

Arrow R cheatsheet shows an example of using dplyr https://github.com/apache/arrow/blob/master/r/cheatsheet/arrow-cheatsheet.pdf The cheatsheet talks about larger than memory, but I assume it could be faster to not read whole big csv to memory even if it would fit.

I hope the Stan's special comments in csv's are not making this impossible.

May 17 '22 07:05 avehtari

Interesting --- if I am reading this correctly, arrow supports dplyr's database-style interface right? i.e. mutate() / select() / etc construct queries that are not executed until collect() is called. If so, perhaps a more generic thing would be a draws format that works like draws_df but can use database backends supported by dplyr (including arrow).

May 17 '22 22:05 mjskay

posterior posterior copied to clipboard

Arrow support

posterior
posterior copied to clipboard