find_best_packing is embarassingly parallel
One could at the very least perform these using std::async:
auto make_order = [&f, &orders_ref](auto& predicate) {
std::sort(orders_ref[f].begin(), orders_ref[f].end(), predicate);
++f;
};
make_order(comparator);
(make_order(comparators), ...);
But they could be combined with a direct call to best_packing_for_ordering instead of calling find_best_packing_impl. The workers would of course need to declare their own:
empty_spaces_type root = rect_wh();
root.flipping_mode = input.flipping_mode;
And then after all workers complete, re-insert the best order and handle successful/unsuccessful insertions.
However, it would be good to only do this under #ifndef RECTPACK2D_SINGLETHREADED just in case.
In case RECTPACK2D_SINGLETHREADED is defined, one may further optimize the memory allocations:
Consider that it is unnecessary to allocate as much as count_orders * std::size(subjects). It is in fact sufficient to have 2 * std::size(subjects) pointers allocated.
Now given n = count_valid_subjects:
- The interval
[0, n)is where thecurrent_orderis held. - The interval
[n, 2n)is where thebest_orderis held. - Every time a new order is attempted,
[0, n)is re-sorted with the next comparator without another allocation. - If
current_orderbeatsbest_order,std::copyfrom[0, n)to[n, 2n).- Note that even if it beats the
best_orderevery time, that is still at mostcount_ordersfull linear copies, which is what the current algorithm does anyway at the preparatory stage.
- Note that even if it beats the
I'm thinking out loud here, but since the algorithm is so fast, spinning up threads for every call would be wasteful. What comes to mind instead is a new function, find_best_packing_mt(), which would accept an extra templated thread_pool& tp parameter—this would be a user-provided class with a std::future<order_type> add_task(F&& func) member function (just like Subjects& subjects can be whatever as long as it's iterable and its members implements get_rect()), which the algorithm would call to start asynchronously computing orders.
This should allow the library consumer to use their own thread pool if they're already using one in their application, as well as giving them the "freedom" to choose whether they want to parallelize the algorithm in the first place.
Yes, a separate function instead of fooling around with ifdefs makes sense totally. Admittedly, I'm not even aware what is the common way in C++ to pass a thread pool reference like you can pass a memory allocator to make the management controllable by the caller.
From my experience, the standard in C++ is that there are twenty different coding standards adopted by twenty different large software companies, so IMO what matters here is documenting whatever choice is made properly. I'm open to writing a mockup via a PR if you'd like 🫡
Go ahead - if you wish to work on this, it'd be best to:
- First move around the existing single-threaded code as necessary to make it easily callable from the workers later.
- Ensure the existing CI/CD test passes.
- Depending on how complex that turns out, we might want to first merge modifications up to this point.
- Only then introduce the multithreaded function (call it
find_best_packing_mt_experimental).- Add a default pool that just calls
std::asyncfor every task, but ensure it has to be passed explicitly.
- Add a default pool that just calls
- Add it to examples as the last one (using the default pool) and extend its expected output in
tests/(output should be identical as the first example).
The memory allocation optimization for the single-threaded version should be done in a separate PR after all this is done.
As usual, I'll be able to properly cooperate on the weekend.
Hi, I'm unfortunately going to be late with working on this, studies and all that. But from the little thinking I've done, the important thing is that this multi-threaded approach would first need to fire off all best_packing_for_ordering() calls, then loop over all returned futures with future::wait(). However, find_best_packing_impl() batches this together, so every packing is calculated, then immediately compared to the best one. So I'm wondering whether this could be split somehow to accommodate the MT version. The only thing I'm unsure of is a potential performance difference for the existing single-threaded functions that would have to adopt this new "calculate-everything-then-compare-everyting" ideology, since an additional iteration would be involved.
Yes, I've been thinking about this... perhaps let's scratch the idea of multithreading for now, as I already heard a few voices that the library is too bloated - I realized if push comes to shove, the library user can always just call several find_best_packing_dont_sort in parallel, then find the best one, without even having to reinsert. At most, we should make the public interface reasonable enough that it's easy to parallelize it on one's own. However, feel free to implement the 2n allocation optimization for the single-threaded version. Don't worry, I'm also busy lately so none of this is a priority.