Senti4SD
Senti4SD copied to clipboard
Improve tool handling of very large input files
We need to re-code our script to circumvent the fact that R by default tries to load an entire file into the memory.
The easiest alternative is to use ff
library, which works with dataframe containing heterogeneous data; if data are homogeneous (e.g., a number matrix), then also bigmemory
library would do, but this doesn't appear to be our case.
The most general solutions instead are using Hadoop and map-reduce to parallelize your complex task in smaller, faster subtasks [2], or alternatively, leverage a database for storing and then querying data [3].
[1] https://rpubs.com/msundar/large_data_analysis [2] http://www.bytemining.com/2010/08/taking-r-to-the-limit-part-ii-large-datasets-in-r/ [3] https://www.datasciencecentral.com/profiles/blogs/postgresql-monetdb-and-too-big-for-memory-data-in-r-part-ii