celerity-runtime Coalesce command serialization into single MPI op per task and node

In addition to execution commands, scheduling a task will generate push / await-push / reduction cascades on participating nodes. So far, all of these implicit commands are MPI_Sent to workers individually, initiating O(N²) MPI ops in case of an all-to-all communication pattern. As we have investigated previously, this will overwhelm the management of asynchronous ops at least in OpenMPI and degrade network performance.

Frame Vectors

This PR introduces frame_vector, a contiguous data structure for serializing multiple command_frames as they exist today. The memory layout is as follows:

[size_t num_frames][size_t frame_0_size_bytes][size_t frame_k_size_bytes...][frame_0][frame_k...]

A frame_vector can be allocated in two different manners:

By constructing it from_size_bytes and filling it with a byte-wise copy (aka, MPI_Recv)
By assembling it from scratch:
1. Create a frame_vector_layout, calling reserve_back to compute the required allocation size
2. Create frame_vector_builder from the layout and constructing frames in-place by calling emplace_back
3. Obtaining the final frame_vector from the builder by calling into_vector

This three-step program appears somewhat convoluted, but is easier to implement than an auto-resizing vector and also allows us to build the entire structure with a single call to malloc.

Shared Frame Pointers

Since we are packing together commands only based on their task-id and destination node-id but are otherwise unrelated from the perspective of the Executor, we need to entertain the question how the frame memory should be freed.

The most elegant solution I came up with so far is to introduce a shared_frame_ptr akin to unique_frame_ptr, and create N shared_frame_ptrs in the executor that collectively own the allocation of the receiving frame_vector that contained N commands.

Good Bye Virtual Dependencies

The graph_serializer has almost been re-written entirely to move from a flush-as-you-go model to a collect-and-flush-once aproach. This has the added benefit of requiring only around 2N allocations for N nodes.

So far, we had the concept of a virtual depencency, aka an in-CDAG dependency to a command that exists implicitly and has not actually been flushed to its worker. This only applies to the initial epoch, which is created implicitly in TDAG and CDAG, and parts of the code (especially tests) weren't able to deal with a dependency to a command they did not receive. This virtual dependency special casing became a nuisance when trying to compute command payload sizes in advance.

As a solution, this PR moves to serialize dependencies to init-epoch commands like any other dependency, and equips the (test) code in question to deal with those implicit commands. Technically this means we are inflating each affected command frame by one size_t, but I would argue the performance impact of this will not be measurable.

What is Left to Figure Out

[ ] This PR adds even more frame management types than we already have. We could simplify this, e.g. by
- getting rid of unique_frame_ptr and only ever transmitting frame_vectors (which would then have size 1 for data_frames and waste one size_t of bandwidth)
- making shared-pointer semantics the default and getting rid of unique_*_ptr types
[ ] Benchmark this PR against master to figure out if we have addressed the OpenMPI bottleneck
[ ] Bikeshed terminology - does frame_vector make sense, or is the vector the actual frame (because its being transmitted over the network) and its elements should have a different name?
[x] ~~This PR seems to introduce a bug around reductions that is sometimes observable on DPC++ (see the most recent CI failure). See if this is caused by the modified transitive-dependency handling logic in graph_serializer, or if we depend on the order of commands being sent in an incorrect manner.~~ Was unrelated to this PR after all, and is fixed by #146.