Senti4SD icon indicating copy to clipboard operation
Senti4SD copied to clipboard

Improve tool handling of very large input files

Open bateman opened this issue 5 years ago • 2 comments

We need to re-code our script to circumvent the fact that R by default tries to load an entire file into the memory. The easiest alternative is to use ff library, which works with dataframe containing heterogeneous data; if data are homogeneous (e.g., a number matrix), then also bigmemory library would do, but this doesn't appear to be our case. The most general solutions instead are using Hadoop and map-reduce to parallelize your complex task in smaller, faster subtasks [2], or alternatively, leverage a database for storing and then querying data [3].

[1] https://rpubs.com/msundar/large_data_analysis [2] http://www.bytemining.com/2010/08/taking-r-to-the-limit-part-ii-large-datasets-in-r/ [3] https://www.datasciencecentral.com/profiles/blogs/postgresql-monetdb-and-too-big-for-memory-data-in-r-part-ii

bateman avatar Dec 04 '18 09:12 bateman