miniparquet support memory = FALSE like in spark

Hi,

Assuming that is (even technically) possible, it would be useful to have the data indexed (but not loaded yet in the RAM) like in sparklyr (see https://www.rdocumentation.org/packages/sparklyr/versions/1.0.2/topics/spark_read_parquet)

That would allow the user to load very large parquet files but pay only for what is actually used (similarly to what vroom does https://github.com/r-lib/vroom)

what do you think? Thanks!

Sep 19 '19 17:09 randomgambit

Yes, I plan to implement ALTREP features also for the parquet reader similar to VROOM.

Sep 20 '19 08:09 hannes

great idea!! maybe you should work with Jim Hester (@jimhester, vroom author) to get a single package that handles csv + parquet super fast? that would be a killer package in my opinion! and more dev are needed to fix bugs and other inefficiencies. what do you think?

Sep 20 '19 11:09 randomgambit

Check out the altrep branch in this repo... for now, it materialises everything at once, but things like this should no longer read any unrelated payload data:

a <- miniparquet::read_parquet("...")
names(a)
mean(a$col)

Sep 23 '19 10:09 hannes

See also https://twitter.com/hfmuehleisen/status/1176410678967640065?s=20

Sep 24 '19 08:09 hannes