bolt icon indicating copy to clipboard operation
bolt copied to clipboard

The ability to control the concurrency of parallelize

Open Sharpie opened this issue 4 years ago • 8 comments

Use Case

Given a large list of inputs that must be processed on a remote node, the parallelize() function could be used with run_task to effect parallel processing:

$large_list.parallelize |$item| {
  run_task('some_module::io_intensive_analysis', $target_with_fancy_tools, {input => $item})
}

However, if the processing step is resource intensive, there needs to be a way to control how many tasks are dispatched in parallel. This could be done by wrapping the call to parallelize in slice to chunk the inputs up:

$large_list.slice(10).each |$chunk| {
  $chunk.parallelize |$item| {
    run_task('some_module::io_intensive_analysis', $target_with_fancy_tools, {input => $item})
  }
}

However, this approach is extremely inefficient if the processing time of each item has a large variance. For a hypothetical task that could finish in tens of seconds or tens of minutes, the slice approach would leave a large percentage of processing capacity idle for each chunk that contains an outlier --- this could add hours to the total plan runtime.

Describe the Solution You Would Like

A way to limit the concurrency of a parallelize block such that it processes items as quickly as possible, but no more than n at a time. This could be an optional parameter to the parallelize() function.

Sharpie avatar Jun 15 '21 23:06 Sharpie

Another aspect to this is that SSHD defaults to allowing a maximum of 10 starting SSH connections:

https://man7.org/linux/man-pages/man5/sshd_config.5.html

So, even if the task is fast or a semaphore is pushed down into the task, the SSH daemon will fail tasks if parallelize() tries to start everything at once.

Sharpie avatar Jun 16 '21 19:06 Sharpie

This issue has not had activity for 60 days and will be marked as stale. If this issue continues to have no activity for 7 days, it will be closed.

github-actions[bot] avatar Jul 29 '22 01:07 github-actions[bot]

This is still important. As it stands, parallelize() offers concurrency, but not efficiency as the runtime of a group of tasks is pegged to its slowest execution.

Sharpie avatar Jul 29 '22 18:07 Sharpie

This issue has not had activity for 60 days and will be marked as stale. If this issue continues to have no activity for 7 days, it will be closed.

github-actions[bot] avatar Sep 28 '22 00:09 github-actions[bot]

This issue is stale and has been closed. If you believe this is in error, or would like the Bolt team to reconsider it, please reopen the issue.

github-actions[bot] avatar Oct 05 '22 00:10 github-actions[bot]

meh, why was this closed? I think this still isn't resolved? Could this be reopened please?

bastelfreak avatar Mar 02 '23 22:03 bastelfreak

The stale issue bot closed it (several times). I have disabled that bot, this will remain open until we triage it. Thanks.

donoghuc avatar Mar 02 '23 22:03 donoghuc

Could this issue please be re-opened?

fraenki avatar Nov 06 '23 10:11 fraenki