hnn-core
hnn-core copied to clipboard
[WIP] ENH: add ability to spawn MPI jobs from parent job
The goal here is to allow MPIBackend
to spawn child jobs from an existing MPI job/communicator. A user could then run any simulation script (e.g., plot_simulate_mpi_backend.py
) from a computing cluster with the following command and it will spawn as many sub-processes as are specified by MPIBackend(n_procs=...)
to complete the simulation job:
$ mpiexec -np 1 --oversubscribe python -m mpi4py /path/to/script.py
I'm currently interested in this use-case for two reasons:
- It will allow a user to implement some more complex yet faster MPI configurations on a computing cluster where one can e.g. parallelize across both simulations and neurons, where currently only either scenario or the other is possible.
- It solves (I think) some the security concerns with allowing a process to instantiate its own MPI parent job, which is what
MPIBackend
currently does.
closes #477
For the time being, I've created a new file called mpi_comm_spawn_child.py
that more-or-less mirrors the style of mpi_child.py
. Eventually, these files should be combined into one. It also currently runs, so give it a try!
Codecov Report
Merging #506 (c616558) into master (c7460ba) will decrease coverage by
0.87%
. The diff coverage is34.78%
.
:exclamation: Current head c616558 differs from pull request most recent head 99c7baf. Consider uploading reports for the commit 99c7baf to get more accurate results
@@ Coverage Diff @@
## master #506 +/- ##
==========================================
- Coverage 87.63% 86.75% -0.88%
==========================================
Files 19 20 +1
Lines 3792 3843 +51
==========================================
+ Hits 3323 3334 +11
- Misses 469 509 +40
Impacted Files | Coverage Δ | |
---|---|---|
hnn_core/mpi_child.py | 96.47% <ø> (ø) |
|
hnn_core/mpi_comm_spawn_child.py | 0.00% <0.00%> (ø) |
|
hnn_core/parallel_backends.py | 79.68% <60.00%> (-1.81%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update c7460ba...99c7baf. Read the comment docs.
I'm interested! You need to educate me a little over a video call what was the motivation, what are the pros/cons of spawning jobs through MPI vs Python etc
Let's setup a time to chat more about this!
I'd definitely like to be apart of the meeting as well. I'm wondering if this is also the reason why MPI doesn't work in WSL?
Propose a time! I'm okay with Wed/Thu afternoon and anytime Friday.
@rythorpe I added the comment that it closes #477
FYI, I think the reason MPIBackend
is setup the way it is currently (i.e., prior to this PR) is specifically to provide timeouts + tracebacks when errors occur prior to a blocking MPI call. (See https://github.com/mpi4py/mpi4py/issues/53 for a discussion on this topic.) An MPIBackend
subprocess created via subprocess.Popen
enforces a timeout when waiting for communications to/from child MPI processes and terminates them when a timeout monitored by the parent process is exceeded, capturing stderr
and stdout
on the way out.
Correspondingly, unit tests that explicitly test the guts of MPIBackend
should use the original codepath (i.e., calling mpiexec
internally when mpi_comm_spawn=False
) so that tracebacks can be preserved and logged. Tests for the new functionality introduced here will also be written, however they should not replace the previous tests.
seems like this might be a way to go: https://mpi4py.readthedocs.io/en/stable/mpi4py.run.html
using an -m
flag? Try to avoid peppering the codebase with try/except
blocks ... it's really anti-pattern. But if that's the only way to get MPI to work gracefully, you can put in one function ...
seems like this might be a way to go: https://mpi4py.readthedocs.io/en/stable/mpi4py.run.html using an
-m
flag? Try to avoid peppering the codebase withtry/except
blocks ... it's really anti-pattern. But if that's the only way to get MPI to work gracefully, you can put in one function ...
Yes, this is what I've been doing when calling mpiexec
, however, I haven't really explored using try
/except
statements. In order to get this to be a robust and dependable solution for handling errors, we'd have to put try
/except
statements prior to every blocking MPI call in the codebase. This might work though since there aren't many MPI calls... I'll need to look into this more.
Minor update, I finally got this branch to run off of a master MPI process on Oscar (Brown's HPC). See here for a demo.