rectpack2D icon indicating copy to clipboard operation
rectpack2D copied to clipboard

find_best_packing is embarassingly parallel

Open geneotech opened this issue 5 months ago • 7 comments

One could at the very least perform these using std::async:

		auto make_order = [&f, &orders_ref](auto& predicate) {
			std::sort(orders_ref[f].begin(), orders_ref[f].end(), predicate);
			++f;
		};

		make_order(comparator);
		(make_order(comparators), ...);

But they could be combined with a direct call to best_packing_for_ordering instead of calling find_best_packing_impl. The workers would of course need to declare their own:

		empty_spaces_type root = rect_wh();
		root.flipping_mode = input.flipping_mode;

And then after all workers complete, re-insert the best order and handle successful/unsuccessful insertions. However, it would be good to only do this under #ifndef RECTPACK2D_SINGLETHREADED just in case.

geneotech avatar Oct 11 '25 18:10 geneotech

In case RECTPACK2D_SINGLETHREADED is defined, one may further optimize the memory allocations:

Consider that it is unnecessary to allocate as much as count_orders * std::size(subjects). It is in fact sufficient to have 2 * std::size(subjects) pointers allocated.

Now given n = count_valid_subjects:

  • The interval [0, n) is where the current_order is held.
  • The interval [n, 2n) is where the best_order is held.
  • Every time a new order is attempted, [0, n) is re-sorted with the next comparator without another allocation.
  • If current_order beats best_order, std::copy from [0, n) to [n, 2n).
    • Note that even if it beats the best_order every time, that is still at most count_orders full linear copies, which is what the current algorithm does anyway at the preparatory stage.

geneotech avatar Oct 12 '25 13:10 geneotech

I'm thinking out loud here, but since the algorithm is so fast, spinning up threads for every call would be wasteful. What comes to mind instead is a new function, find_best_packing_mt(), which would accept an extra templated thread_pool& tp parameter—this would be a user-provided class with a std::future<order_type> add_task(F&& func) member function (just like Subjects& subjects can be whatever as long as it's iterable and its members implements get_rect()), which the algorithm would call to start asynchronously computing orders.

This should allow the library consumer to use their own thread pool if they're already using one in their application, as well as giving them the "freedom" to choose whether they want to parallelize the algorithm in the first place.

lyorig avatar Oct 14 '25 11:10 lyorig

Yes, a separate function instead of fooling around with ifdefs makes sense totally. Admittedly, I'm not even aware what is the common way in C++ to pass a thread pool reference like you can pass a memory allocator to make the management controllable by the caller.

geneotech avatar Oct 14 '25 13:10 geneotech

From my experience, the standard in C++ is that there are twenty different coding standards adopted by twenty different large software companies, so IMO what matters here is documenting whatever choice is made properly. I'm open to writing a mockup via a PR if you'd like 🫡

lyorig avatar Oct 14 '25 13:10 lyorig

Go ahead - if you wish to work on this, it'd be best to:

  • First move around the existing single-threaded code as necessary to make it easily callable from the workers later.
  • Ensure the existing CI/CD test passes.
  • Depending on how complex that turns out, we might want to first merge modifications up to this point.
  • Only then introduce the multithreaded function (call it find_best_packing_mt_experimental).
    • Add a default pool that just calls std::async for every task, but ensure it has to be passed explicitly.
  • Add it to examples as the last one (using the default pool) and extend its expected output in tests/ (output should be identical as the first example).

The memory allocation optimization for the single-threaded version should be done in a separate PR after all this is done.
As usual, I'll be able to properly cooperate on the weekend.

geneotech avatar Oct 14 '25 14:10 geneotech

Hi, I'm unfortunately going to be late with working on this, studies and all that. But from the little thinking I've done, the important thing is that this multi-threaded approach would first need to fire off all best_packing_for_ordering() calls, then loop over all returned futures with future::wait(). However, find_best_packing_impl() batches this together, so every packing is calculated, then immediately compared to the best one. So I'm wondering whether this could be split somehow to accommodate the MT version. The only thing I'm unsure of is a potential performance difference for the existing single-threaded functions that would have to adopt this new "calculate-everything-then-compare-everyting" ideology, since an additional iteration would be involved.

lyorig avatar Oct 22 '25 08:10 lyorig

Yes, I've been thinking about this... perhaps let's scratch the idea of multithreading for now, as I already heard a few voices that the library is too bloated - I realized if push comes to shove, the library user can always just call several find_best_packing_dont_sort in parallel, then find the best one, without even having to reinsert. At most, we should make the public interface reasonable enough that it's easy to parallelize it on one's own. However, feel free to implement the 2n allocation optimization for the single-threaded version. Don't worry, I'm also busy lately so none of this is a priority.

geneotech avatar Oct 27 '25 21:10 geneotech