remap icon indicating copy to clipboard operation
remap copied to clipboard

Investigate keeping processes around

Open gtoonstra opened this issue 9 years ago • 0 comments

The map/reduce examples have clear boundaries between startup, reading data, processing data and writing it out to disk. The process lifetime doesn't extend beyond those boundaries, which always perpetuates the cost of disk usage.

Similar to apache spark, avoiding disk access saves disk access, which can augment performance. It is important to realize that the boundaries of the processing isn't different from disk/memory access. The only difference is that at the moment where the mapper (for example) writes a partition to disk and exits, it would simply stay around to wait for queries to be executed against the data in the partitions.

what's left is figure out how to express the functions to be executed against the data (which may be in any format) in a consistent way. Most of them are aggregation functions:

  • sum
  • group?
  • etc

Joins are a lot harder to achieve. Maybe the mapper/reducer process itself can implement specific functions that dictate how this is done, so that the framework doesn't become overly generic and hard to read.

gtoonstra avatar Jul 04 '15 12:07 gtoonstra