Distributed.jl icon indicating copy to clipboard operation
Distributed.jl copied to clipboard

Decision: Use of `asyncmap` in pmap batch mode

Open amitmurthy opened this issue 8 years ago • 8 comments

pmap in batch mode uses a local asyncmap to process a batch - https://github.com/JuliaLang/julia/blob/9e3318c9840e7a9e387582ba861408cefe5a4f75/base/distributed/pmap.jl#L198

Considering that each computation in pmap is fairly large, and batch sizes small, an asyncmap would not have a major overhead and if the computation involves IO, quite beneficial.

For example, if the input is a list of file names to be processed, it is efficient to interleave I/O and computation and hence a local asyncmap is a better fit.

This issue is to take a decision whether to

  1. Keep it as it is, i.e., no change - the batch is processed using asyncmap

  2. Change it to a local map. If the computation involves I/O the caller would have to explicitly partition the input and the mapping function in turn would need to perform an asyncmap and a flatten on the final output.

  3. Add another keyword arg to pmap, batch_function=map. To use asyncmap, the caller would need to explicitly specify batch_function=asyncmap. A user defined function (for example one that uses @threads ) can also be specified.

amitmurthy avatar May 26 '17 05:05 amitmurthy

@tanmaykm Was there an example of the asyncmap leading to a performance degradation? If we have a concrete example, that would be good to drive this.

I do like the idea of having a batch execution function inside of pmap, which could use threads.

ViralBShah avatar May 29 '17 05:05 ViralBShah

Another way to do the async case may be to allow an @async pmap(...) syntax for the asynchronous case.

ViralBShah avatar May 29 '17 05:05 ViralBShah

+1 for having the batch_function keyword, with maybe asyncmap as the default.

tanmaykm avatar May 29 '17 05:05 tanmaykm

If there is performance degradation for compute bound jobs, I would rather have map be the default and @async pmap(...) the way to do the async case. It may also be the case that pmap has too much functionality loaded into it.

ViralBShah avatar May 29 '17 05:05 ViralBShah

@async pmap(...) will not work as you intend it to - It will just execute the pmap in a new task.

I am for a new keyword arg, batch_function=map, i.e., with map as the default.

amitmurthy avatar May 29 '17 05:05 amitmurthy

With a correct batchsize I don't think a local asyncmap will have any adverse performance impact.

tanmaykm avatar May 29 '17 05:05 tanmaykm

I was imagining @async special casing on the pmap argument - but that is perhaps too ugly. batch_function is ok.

ViralBShah avatar May 29 '17 05:05 ViralBShah

That was my original thinking. But now I realize it is difficult to predict how it will be used. for example folks are trying with batchsize of 1000.

amitmurthy avatar May 29 '17 06:05 amitmurthy