axon
axon copied to clipboard
Issues for using MPI in GUI interactive mode
mpi depends entirely on each proc
executing the same sequence of AllGather
etc calls at the same times. If any node doesn't, everything just waits and then probably timeouts with an error.. When running a fixed -nogui
run, there is no problem here.
But when running interactively, each node needs to get the user's commands to start, stop, step, Init, etc, so they can all stay sync'd. Thus, we need an additional outer-loop of communication where the proc > 0 nodes wait for commands and then run them, all the while checking to see if a stop command has come in.
Probably this should be done using something other than mpi, because it needs to be non-blocking and more dynamic. Someone with appropriate network communication knowledge should probably take this on..
I wouldn't use a different protocol, mostly because if we just MPI for everything we only have to do the MPI_World setup once. With a different protocol it'll get complicated once we have cross-machine MPI with ssh setups etc.
Can't we just put a MPI.BCast from the root node (where the GUI runs) to all other procs into the GUI loop, that tells the other procs about current user input (start, stop etc)? It should really be blocking, else you'll run into the same issues with timed-out AllReduces. Using blocking will add a ~10μs of latency, which will be fast enough to not be noticeable.