Decision: Use of `asyncmap` in pmap batch mode
pmap in batch mode uses a local asyncmap to process a batch - https://github.com/JuliaLang/julia/blob/9e3318c9840e7a9e387582ba861408cefe5a4f75/base/distributed/pmap.jl#L198
Considering that each computation in pmap is fairly large, and batch sizes small, an asyncmap would not have a major overhead and if the computation involves IO, quite beneficial.
For example, if the input is a list of file names to be processed, it is efficient to interleave I/O and computation and hence a local asyncmap is a better fit.
This issue is to take a decision whether to
-
Keep it as it is, i.e., no change - the batch is processed using
asyncmap -
Change it to a local map. If the computation involves I/O the caller would have to explicitly partition the input and the mapping function in turn would need to perform an
asyncmapand a flatten on the final output. -
Add another keyword arg to pmap,
batch_function=map. To useasyncmap, the caller would need to explicitly specifybatch_function=asyncmap. A user defined function (for example one that uses@threads) can also be specified.
@tanmaykm Was there an example of the asyncmap leading to a performance degradation? If we have a concrete example, that would be good to drive this.
I do like the idea of having a batch execution function inside of pmap, which could use threads.
Another way to do the async case may be to allow an @async pmap(...) syntax for the asynchronous case.
+1 for having the batch_function keyword, with maybe asyncmap as the default.
If there is performance degradation for compute bound jobs, I would rather have map be the default and @async pmap(...) the way to do the async case. It may also be the case that pmap has too much functionality loaded into it.
@async pmap(...) will not work as you intend it to - It will just execute the pmap in a new task.
I am for a new keyword arg, batch_function=map, i.e., with map as the default.
With a correct batchsize I don't think a local asyncmap will have any adverse performance impact.
I was imagining @async special casing on the pmap argument - but that is perhaps too ugly. batch_function is ok.
That was my original thinking. But now I realize it is difficult to predict how it will be used. for example folks are trying with batchsize of 1000.