The goal here is to allow MPIBackend to spawn child jobs from an existing MPI job/communicator. A user could then run any simulation script (e.g., plot_simulate_mpi_backend.py) from a computing cluster with the following command and it will spawn as many sub-processes as are specified by MPIBackend(n_procs=...) to complete the simulation job:

$ mpiexec -np 1 --oversubscribe python -m mpi4py /path/to/script.py

I'm currently interested in this use-case for two reasons:

It will allow a user to implement some more complex yet faster MPI configurations on a computing cluster where one can e.g. parallelize across both simulations and neurons, where currently only either scenario or the other is possible.
It solves (I think) some the security concerns with allowing a process to instantiate its own MPI parent job, which is what MPIBackend currently does.

closes #477

Jul 17 '22 22:07 rythorpe

For the time being, I've created a new file called mpi_comm_spawn_child.py that more-or-less mirrors the style of mpi_child.py. Eventually, these files should be combined into one. It also currently runs, so give it a try!

Jul 17 '22 22:07 rythorpe

Codecov Report

Merging #506 (c616558) into master (c7460ba) will decrease coverage by 0.87%. The diff coverage is 34.78%.

:exclamation: Current head c616558 differs from pull request most recent head 99c7baf. Consider uploading reports for the commit 99c7baf to get more accurate results

@@            Coverage Diff             @@
##           master     #506      +/-   ##
==========================================
- Coverage   87.63%   86.75%   -0.88%     
==========================================
  Files          19       20       +1     
  Lines        3792     3843      +51     
==========================================
+ Hits         3323     3334      +11     
- Misses        469      509      +40

Impacted Files	Coverage Δ
hnn_core/mpi_child.py	`96.47% <ø> (ø)`
hnn_core/mpi_comm_spawn_child.py	`0.00% <0.00%> (ø)`
hnn_core/parallel_backends.py	`79.68% <60.00%> (-1.81%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c7460ba...99c7baf. Read the comment docs.

Jul 17 '22 22:07 codecov-commenter

I'm interested! You need to educate me a little over a video call what was the motivation, what are the pros/cons of spawning jobs through MPI vs Python etc

Jul 18 '22 01:07 jasmainak

Let's setup a time to chat more about this!

Jul 18 '22 02:07 rythorpe

I'd definitely like to be apart of the meeting as well. I'm wondering if this is also the reason why MPI doesn't work in WSL?

Jul 18 '22 05:07 ntolley

Propose a time! I'm okay with Wed/Thu afternoon and anytime Friday.

Jul 18 '22 20:07 jasmainak

@rythorpe I added the comment that it closes #477

Jul 22 '22 16:07 jasmainak

FYI, I think the reason MPIBackend is setup the way it is currently (i.e., prior to this PR) is specifically to provide timeouts + tracebacks when errors occur prior to a blocking MPI call. (See https://github.com/mpi4py/mpi4py/issues/53 for a discussion on this topic.) An MPIBackend subprocess created via subprocess.Popen enforces a timeout when waiting for communications to/from child MPI processes and terminates them when a timeout monitored by the parent process is exceeded, capturing stderr and stdout on the way out.

Correspondingly, unit tests that explicitly test the guts of MPIBackend should use the original codepath (i.e., calling mpiexec internally when mpi_comm_spawn=False) so that tracebacks can be preserved and logged. Tests for the new functionality introduced here will also be written, however they should not replace the previous tests.

Jul 22 '22 18:07 rythorpe

seems like this might be a way to go: https://mpi4py.readthedocs.io/en/stable/mpi4py.run.html

using an -m flag? Try to avoid peppering the codebase with try/except blocks ... it's really anti-pattern. But if that's the only way to get MPI to work gracefully, you can put in one function ...

Jul 23 '22 03:07 jasmainak

seems like this might be a way to go: https://mpi4py.readthedocs.io/en/stable/mpi4py.run.html using an -m flag? Try to avoid peppering the codebase with try/except blocks ... it's really anti-pattern. But if that's the only way to get MPI to work gracefully, you can put in one function ...

Yes, this is what I've been doing when calling mpiexec, however, I haven't really explored using try/except statements. In order to get this to be a robust and dependable solution for handling errors, we'd have to put try/except statements prior to every blocking MPI call in the codebase. This might work though since there aren't many MPI calls... I'll need to look into this more.

Jul 24 '22 23:07 rythorpe

Minor update, I finally got this branch to run off of a master MPI process on Oscar (Brown's HPC). See here for a demo.

Jan 27 '23 19:01 rythorpe

hnn-core
hnn-core copied to clipboard

[WIP] ENH: add ability to spawn MPI jobs from parent job

Codecov Report

hnn-core hnn-core copied to clipboard

[WIP] ENH: add ability to spawn MPI jobs from parent job

Codecov Report

hnn-core
hnn-core copied to clipboard