Add "dynamic" parallelization strategies.
This change adds the following parallelization functions:
pthreadpool_parallelize_1d_dynamic,pthreadpool_parallelize_2d_dynamic_1d,pthreadpool_parallelize_2d_dynamic, andpthreadpool_parallelize_3d_dynamic_2d.
These functions are similar to the pthreadpool_parallelize_Xd_tile_Yd functions, but differ in that the tile_* values are not bounds, but instead preferred multiples.
For example, pthreadpool_parallelize_1d_dynamic(threadpool, task, context, range, tile, flags) calls the user-provided task function
task(context, offset, count)
where
offsetis in the range[0, range)and an integer multiple oftile, andcountis an integer multiple oftileunlessoffset + count = range.
The tile parameter is understood as a preferred multiple of indices, and not as an upper bound.
The counts are chosen such as to minimize the number of calls to task while still balancing the workload across all threads.
Under the hood, each thread tries to reserve a "chunk" of tiles corresponding to
size_t chunk_size = max(num_tasks / (2 * num_threads), 1);
chunk_size = ((chunk_size + tile - 1) / tile) * tile;
i.e. the number of remaining tasks divided by 2x the number of threads, rounded up to the next multiple of tile. This balances well provided the speed difference between the fastest and slowest threads does not exceed a factor of 2.
Hi @Maratyszcza,
Please let me know if you have the time to review this PR or not, or whether you are still accepting PRs or not :relaxed:
Cheers, Pedro