miniparquet icon indicating copy to clipboard operation
miniparquet copied to clipboard

support memory = FALSE like in spark

Open randomgambit opened this issue 6 years ago • 4 comments

Hi,

Assuming that is (even technically) possible, it would be useful to have the data indexed (but not loaded yet in the RAM) like in sparklyr (see https://www.rdocumentation.org/packages/sparklyr/versions/1.0.2/topics/spark_read_parquet)

That would allow the user to load very large parquet files but pay only for what is actually used (similarly to what vroom does https://github.com/r-lib/vroom)

what do you think? Thanks!

randomgambit avatar Sep 19 '19 17:09 randomgambit

Yes, I plan to implement ALTREP features also for the parquet reader similar to VROOM.

hannes avatar Sep 20 '19 08:09 hannes

great idea!! maybe you should work with Jim Hester (@jimhester, vroom author) to get a single package that handles csv + parquet super fast? that would be a killer package in my opinion! and more dev are needed to fix bugs and other inefficiencies. what do you think?

randomgambit avatar Sep 20 '19 11:09 randomgambit

Check out the altrep branch in this repo... for now, it materialises everything at once, but things like this should no longer read any unrelated payload data:

a <- miniparquet::read_parquet("...")
names(a)
mean(a$col)

hannes avatar Sep 23 '19 10:09 hannes

See also https://twitter.com/hfmuehleisen/status/1176410678967640065?s=20

hannes avatar Sep 24 '19 08:09 hannes