celerity-runtime
celerity-runtime copied to clipboard
Coalesce command serialization into single MPI op per task and node
In addition to execution commands, scheduling a task will generate push / await-push / reduction cascades on participating nodes. So far, all of these implicit commands are MPI_Sent
to workers individually, initiating O(N²) MPI ops in case of an all-to-all communication pattern. As we have investigated previously, this will overwhelm the management of asynchronous ops at least in OpenMPI and degrade network performance.
Frame Vectors
This PR introduces frame_vector
, a contiguous data structure for serializing multiple command_frame
s as they exist today. The memory layout is as follows:
[size_t num_frames][size_t frame_0_size_bytes][size_t frame_k_size_bytes...][frame_0][frame_k...]
A frame_vector
can be allocated in two different manners:
- By constructing it
from_size_bytes
and filling it with a byte-wise copy (aka,MPI_Recv
) - By assembling it from scratch:
- Create a
frame_vector_layout
, callingreserve_back
to compute the required allocation size - Create
frame_vector_builder
from thelayout
and constructing frames in-place by callingemplace_back
- Obtaining the final
frame_vector
from thebuilder
by callinginto_vector
- Create a
This three-step program appears somewhat convoluted, but is easier to implement than an auto-resizing vector and also allows us to build the entire structure with a single call to malloc
.
Shared Frame Pointers
Since we are packing together commands only based on their task-id and destination node-id but are otherwise unrelated from the perspective of the Executor, we need to entertain the question how the frame memory should be freed.
The most elegant solution I came up with so far is to introduce a shared_frame_ptr
akin to unique_frame_ptr
, and create N shared_frame_ptr
s in the executor that collectively own the allocation of the receiving frame_vector
that contained N commands.
Good Bye Virtual Dependencies
The graph_serializer
has almost been re-written entirely to move from a flush-as-you-go model to a collect-and-flush-once aproach. This has the added benefit of requiring only around 2N allocations for N nodes.
So far, we had the concept of a virtual depencency, aka an in-CDAG dependency to a command that exists implicitly and has not actually been flushed to its worker. This only applies to the initial epoch, which is created implicitly in TDAG and CDAG, and parts of the code (especially tests) weren't able to deal with a dependency to a command they did not receive. This virtual dependency special casing became a nuisance when trying to compute command payload sizes in advance.
As a solution, this PR moves to serialize dependencies to init-epoch commands like any other dependency, and equips the (test) code in question to deal with those implicit commands. Technically this means we are inflating each affected command frame by one size_t
, but I would argue the performance impact of this will not be measurable.
What is Left to Figure Out
- [ ] This PR adds even more frame management types than we already have. We could simplify this, e.g. by
- getting rid of
unique_frame_ptr
and only ever transmittingframe_vector
s (which would then have size 1 fordata_frames
and waste onesize_t
of bandwidth) - making shared-pointer semantics the default and getting rid of
unique_*_ptr
types
- getting rid of
- [ ] Benchmark this PR against master to figure out if we have addressed the OpenMPI bottleneck
- [ ] Bikeshed terminology - does
frame_vector
make sense, or is the vector the actual frame (because its being transmitted over the network) and its elements should have a different name? - [x] ~~This PR seems to introduce a bug around reductions that is sometimes observable on DPC++ (see the most recent CI failure). See if this is caused by the modified transitive-dependency handling logic in
graph_serializer
, or if we depend on the order of commands being sent in an incorrect manner.~~ Was unrelated to this PR after all, and is fixed by #146.
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
This will be made obsolete by distributed command generation in the future, which obviates the need for sending out commands.