Dagger.jl
Dagger.jl copied to clipboard
Add DaggerMPI subpackage for MPI integrations
Building on Dagger's unified hashing framework, DaggerMPI.jl allows DAGs to execute efficiently under an MPI cluster. Per-task hashes are used to "color" the DAG, disabling execution of each task on all but one MPI worker. Data movement is typically peer-to-peer using MPI Send and Recv, and is coordinated by using tags computed from the same coloring scheme. This scheme allows Dagger's scheduler to remain unmodified and unaware of the existence of an MPI cluster, while still providing "exactly once" execution semantics for each task in the DAG.
This PR is mostly good-to-go, however there is one aspect I'm not happy about: all nodes must spawn all the same tasks in the same order, or else we get a hang. This is currently necessary because we do one-sided "blind" sends and receives, assuming that the counterpart will be posted.
A partial fix for this problem might involve asynchronously exchanging task hashes between nodes, and when we find a task hash that hasn't been registered on a given node, we "vote" to assign it to a node which does have it (and ensure all nodes are aware of that decision for downstream tasks which depend on that task's result). We can initiate this vote from any node which has tasks that are stalled waiting on data (send or receive); it's basically a more active way to ensure that the data becomes available, by ensuring that some node will eventually post/consume the result.
That fix is really only a workaround; a more complete solution might involve providing a way to inform the cluster that there is conditional logic around a task spawn point, and using that as a cue to initiate an early vote, or let the user explicitly select which node(s) to consider.