remake icon indicating copy to clipboard operation
remake copied to clipboard

Solution for a large number of files?

Open kendonB opened this issue 8 years ago • 4 comments

Hi, I'm new to remake.

I haven't worked out a nice way to deal with a large number of files in remake. I have an application where a function or script reads a large number of files (think daily data with a file per day), which I'd like to track in remake, and outputs a smaller but still large number of files (think monthly aggregates of some subset of the daily data).

As far as I can tell, in the current version, I would have to make individual targets for each raw data download, and only have a single target for each of the monthly aggregations.

This cries out for a loop, but I haven't seen how to do that in remake.

Thoughts? Thanks for any help.

kendonB avatar Apr 07 '17 03:04 kendonB

Good question. I think this relates to #2. There are some great templating ideas on that thread. For your use case, you might use the yaml package to write each day'sremake.yml file, though I suppose keeping it up to date could be a pain. The proposed solution by @krlmlr on Nov 25, 2016 seems ideal for you, but I do not think that that functionality has been implemented yet.

Last summer, I wrote remakeGenerator to programmatically generate remake.yml files from data frames of commands. It has some functions to manipulate these data frames (analyses(), summaries(), expand(), evaluate(), gather()) so you do not have to write everything by hand.

wlandau avatar Apr 07 '17 04:04 wlandau

@wlandau Do you have thoughts on using remake + remakeGenerator vs Drake for these sorts of biggish data projects? I'm running things which use memory up to around 180GB and total disk usage of around 300-400GB.

kendonB avatar Apr 07 '17 04:04 kendonB

@kendonB After my company lets me release the latest drake patch (which fixes issues 16, 17, 18, and 19), I am not sure how the two options will stack up in terms of performance. I do know that remake has been around longer and has stood the test of far more projects, and I have not tested drake on files that large. I could compare remake and drake in more detail, but I do not think this is the place for that. However, I do think this is a good opportunity for some benchmarking. Is your project public? If you use remake and remakeGenerator, maybe I could port it to drake and compare.

wlandau avatar Apr 07 '17 05:04 wlandau

@kendonB I thought about your use case a little more, and I think there may be more to say.

  • Speed: I have not done much benchmarking to compare drake to remake, so I cannot really speak to this yet (except that drake issue 18 will be patched relatively soon).
  • Storage: drake and remake both use storr to maintain the cache. For each file target, both packages track the file's fingerprint rather than the file itself. My intuition says that both should have about the same storage efficiency.
  • Memory: I am not sure how remake manages objects in memory (though #156 improves this). As for drake, I try to conserve memory using envir.R. Before each parallelizable stage of targets, drake loads the targets it needs and unloads the targets it will never need again. During each stage, newly-made targets are stored in memory in case they will be needed to make future targets, a decision that prioritizes speed over memory consumption.

wlandau avatar Apr 07 '17 13:04 wlandau