hnn-core icon indicating copy to clipboard operation
hnn-core copied to clipboard

[WIP] added option to run as native MPI

Open kenneth59715 opened this issue 1 year ago • 3 comments

modifications to parallel_backends.py and mpi_child.py allows running as native MPI. Will still run as before, if mpi_cmd is not ALREADYMPI.

kenneth59715 avatar Feb 20 '25 20:02 kenneth59715

Thank you for your contribution Kenneth!

Please note that we are planning on merging a PR that will make changes to some of the same code lines that you have made changes to (#871). We will try to get these merged onto master tomorrow.

After we have merged those changes, would you please mind rebasing your commit on the latest version of the master branch? Alternatively, if would like, I can try rebasing your branch myself. Either way, this will make it significantly easier for us to review your code changes.

I will note that we take changes to the MPI API seriously, so this PR may end up taking a good bit of time and effort, particularly for testing.

Finally, can I ask, what OS are you using? Are you primarily planning on using this from HPC environments?

Thanks, Austin

asoplata avatar Feb 20 '25 21:02 asoplata

Hey Kenneth, we have merged a different PR's changes to the same files that you have made changes to. Please rebase your PR onto the current master branch. Alternatively, if you would like me to do it, I can do the rebasing; just let me know.

asoplata avatar Feb 20 '25 22:02 asoplata

Hi,

Maybe you could try rebasing, then if you run into any issues, let me know.

Kenneth

Sent with Proton Mail secure email.

On Thursday, February 20th, 2025 at 2:41 PM, Austin E. Soplata @.***> wrote:

Hey Kenneth, we have merged a different PR's changes to the same files that you have made changes to. Please rebase your PR onto the current master branch. Alternatively, if you would like me to do it, I can do the rebasing; just let me know.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

[asoplata]asoplata left a comment (jonescompneurolab/hnn-core#998)

Hey Kenneth, we have merged a different PR's changes to the same files that you have made changes to. Please rebase your PR onto the current master branch. Alternatively, if you would like me to do it, I can do the rebasing; just let me know.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

kenneth59715 avatar Feb 20 '25 23:02 kenneth59715

Apologies for the multiple force-pushes: the first one had all the code changes, but I realized it also changed the authorship, so I had to push a few more times to make it clear that you were the primary author of the commits.

I have rebased and rewritten the history of your nativempi branch. If you have any uncommitted or unpushed work, please make a backup before you pull from your origin remote! Once you've done that, if you pull then everyone should be on the same page.

This PR is going to require a lot of testing and a fair number of changes. The first order of business is: Are you satisfied with the rebase? I.e. Do the changes now reflect all the changes that you wanted to make?

asoplata avatar Feb 28 '25 16:02 asoplata

I will also ask: Can you please provide a script illustrating this use-case? Especially how to setup a "native MPI" "host" thing (I don't know the right word for it), such that the new code is utilized? I have never used MPI without using mpiexec, but we also need this knowledge for the documentation of how to use this functionality (this will not be merged without adding documentation explaining how to use it). Thank you

asoplata avatar Feb 28 '25 16:02 asoplata

Austin,

The benchmark code would be invoked with mpiexec and the --alreadympi flag:

mpiexec -np 26 nrniv -python -mpi -nobanner 'benchmark_hnn_core_runtime.py' '--alreadympi'

I think I attached a modified benchmark_hnn_core_runtime.py to one of the emails.

Kenneth

Sent with Proton Mail secure email.

On Friday, February 28th, 2025 at 8:47 AM, Austin E. Soplata @.***> wrote:

I will also ask: Can you please provide a script illustrating this use-case? Especially how to setup a "native MPI" "host" thing (I don't know the right word for it), such that the new code is utilized? I have never used MPI without using mpiexec, but we also need this knowledge for the documentation of how to use this functionality (this will not be merged without adding documentation explaining how to use it). Thank you

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

[asoplata]asoplata left a comment (jonescompneurolab/hnn-core#998)

I will also ask: Can you please provide a script illustrating this use-case? Especially how to setup a "native MPI" "host" thing (I don't know the right word for it), such that the new code is utilized? I have never used MPI without using mpiexec, but we also need this knowledge for the documentation of how to use this functionality (this will not be merged without adding documentation explaining how to use it). Thank you

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

kenneth59715 avatar Feb 28 '25 18:02 kenneth59715

Austin,

Sorry, I haven't used github collaboratively. Is the new version of the nativempi branch in your repo? Should I pull that into my fork?

Kenneth

Sent with Proton Mail secure email.

On Friday, February 28th, 2025 at 8:43 AM, Austin E. Soplata @.***> wrote:

Apologies for the multiple force-pushes: the first one had all the code changes, but I realized it also changed the authorship, so I had to push a few more times to make it clear that you were the primary author of the commits.

I have rebased and rewritten the history of your nativempi branch. If you have any uncommitted or unpushed work, please make a backup before you pull from your origin remote! Once you've done that, if you pull then everyone should be on the same page.

This PR is going to require a lot of testing and a fair number of changes. The first order of business is: Are you satisfied with the rebase? I.e. Do the changes now reflect all the changes that you wanted to make?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

[asoplata]asoplata left a comment (jonescompneurolab/hnn-core#998)

Apologies for the multiple force-pushes: the first one had all the code changes, but I realized it also changed the authorship, so I had to push a few more times to make it clear that you were the primary author of the commits.

I have rebased and rewritten the history of your nativempi branch. If you have any uncommitted or unpushed work, please make a backup before you pull from your origin remote! Once you've done that, if you pull then everyone should be on the same page.

This PR is going to require a lot of testing and a fair number of changes. The first order of business is: Are you satisfied with the rebase? I.e. Do the changes now reflect all the changes that you wanted to make?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

kenneth59715 avatar Feb 28 '25 18:02 kenneth59715

Austin, Sorry, I haven't used github collaboratively. Is the new version of the nativempi branch in your repo? Should I pull that into my fork? Kenneth

I will send you an email so we can separately discuss how to handle the git changes, I am happy to provide assistance with this.

Austin, The benchmark code would be invoked with mpiexec and the --alreadympi flag: mpiexec -np 26 nrniv -python -mpi -nobanner 'benchmark_hnn_core_runtime.py' '--alreadympi' I think I attached a modified benchmark_hnn_core_runtime.py to one of the emails. Kenneth

Okay, thanks, knowing how to properly call the script you gave via email is super helpful. I will try to test it on our HPC soon.

To be clear, part of the reason I was asking for an example use-case is that for us to add a feature like this, the changes need to include documentation for both how and why you would want to use this feature (in addition to tests of the code changes). At minimum, we would want additions to documentation for our code-documentation websites (i.e. https://jonescompneurolab.github.io/hnn-core/stable/index.html ) , such as on https://github.com/jonescompneurolab/hnn-core/blob/master/doc/parallel.md and example code at https://github.com/jonescompneurolab/hnn-core/tree/master/examples . I'm happy to help convert the use-cases you provide into documentation for how to use the feature.

However, I don't yet fully grasp the motivation behind these changes; I have some, but limited experience with MPI in HPC environments. Would you mind expounding on the benefits of this feature? For example, is it faster than using child mpiexec calls (I'm sure the benchmark script is intended for that question), does it make it easier/possible to run HNN across multiple nodes simultaneously, etc.? Any description of the benefits you provide will be very helpful, and we will want to incorporate it into our documentation itself.

asoplata avatar Feb 28 '25 20:02 asoplata

Austin,

I'll try to write up something on Monday. In terms of justification, the reason this mode is useful on HPC clusters is that it's more portable between MPI environments. (It was about 10 seconds faster for a three minute run, but that's not that much difference.) One issue is the command to start the processes in the backend. It could be mpi_exec, but it might also be srun or mpirun, depending on how the environment is configured. Another example would be use of Singularity containers. For Neuroscience Gateway, we like to run apps in Singularity containers, so the parallel backend startup command would need to look something like mpi_exec -n 26 singularity singularity.img nrniv ....

A lot of MPI code is written to be MPI environment-agnostic, so the code doesn't have to worry about environment-specific details, like the invocation command, or whether it's running in a singularity container. For me, that style is easier to manage and work with.

Kenneth

Sent with Proton Mail secure email.

On Friday, February 28th, 2025 at 12:28 PM, Austin E. Soplata @.***> wrote:

Austin, Sorry, I haven't used github collaboratively. Is the new version of the nativempi branch in your repo? Should I pull that into my fork? Kenneth

I will send you an email so we can separately discuss how to handle the git changes, I am happy to provide assistance with this.

Austin, The benchmark code would be invoked with mpiexec and the --alreadympi flag: mpiexec -np 26 nrniv -python -mpi -nobanner 'benchmark_hnn_core_runtime.py' '--alreadympi' I think I attached a modified benchmark_hnn_core_runtime.py to one of the emails. Kenneth

Okay, thanks, knowing how to properly call the script you gave via email is super helpful. I will try to test it on our HPC soon.

To be clear, part of the reason I was asking for an example use-case is that for us to add a feature like this, the changes need to include documentation for both how and why you would want to use this feature (in addition to tests of the code changes). At minimum, we would want additions to documentation for our code-documentation websites (i.e. https://jonescompneurolab.github.io/hnn-core/stable/index.html ) , such as on https://github.com/jonescompneurolab/hnn-core/blob/master/doc/parallel.md and example code at https://github.com/jonescompneurolab/hnn-core/tree/master/examples . I'm happy to help convert the use-cases you provide into documentation for how to use the feature.

However, I don't yet fully grasp the motivation behind these changes; I have some, but limited experience with MPI in HPC environments. Would you mind expounding on the benefits of this feature? For example, is it faster than using child mpiexec calls (I'm sure the benchmark script is intended for that question), does it make it easier/possible to run HNN across multiple nodes simultaneously, etc.? Any description of the benefits you provide will be very helpful, and we will want to incorporate it into our documentation itself.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

[asoplata]asoplata left a comment (jonescompneurolab/hnn-core#998)

Austin, Sorry, I haven't used github collaboratively. Is the new version of the nativempi branch in your repo? Should I pull that into my fork? Kenneth

I will send you an email so we can separately discuss how to handle the git changes, I am happy to provide assistance with this.

Austin, The benchmark code would be invoked with mpiexec and the --alreadympi flag: mpiexec -np 26 nrniv -python -mpi -nobanner 'benchmark_hnn_core_runtime.py' '--alreadympi' I think I attached a modified benchmark_hnn_core_runtime.py to one of the emails. Kenneth

Okay, thanks, knowing how to properly call the script you gave via email is super helpful. I will try to test it on our HPC soon.

To be clear, part of the reason I was asking for an example use-case is that for us to add a feature like this, the changes need to include documentation for both how and why you would want to use this feature (in addition to tests of the code changes). At minimum, we would want additions to documentation for our code-documentation websites (i.e. https://jonescompneurolab.github.io/hnn-core/stable/index.html ) , such as on https://github.com/jonescompneurolab/hnn-core/blob/master/doc/parallel.md and example code at https://github.com/jonescompneurolab/hnn-core/tree/master/examples . I'm happy to help convert the use-cases you provide into documentation for how to use the feature.

However, I don't yet fully grasp the motivation behind these changes; I have some, but limited experience with MPI in HPC environments. Would you mind expounding on the benefits of this feature? For example, is it faster than using child mpiexec calls (I'm sure the benchmark script is intended for that question), does it make it easier/possible to run HNN across multiple nodes simultaneously, etc.? Any description of the benefits you provide will be very helpful, and we will want to incorporate it into our documentation itself.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

kenneth59715 avatar Feb 28 '25 21:02 kenneth59715

Hello Kenneth,

I spoke with my HNN colleagues yesterday, and in addition to the motivations you explained, they also informed me that the current way we spawn child processes may actually be disallowed in some HPC environments due to security policy. Indeed, this is part of the justification for an older, un-merged PR by @rythorpe here: https://github.com/jonescompneurolab/hnn-core/pull/506 (we may want to merge at least some of the code from that PR into this one). Enabling parallel HNN-core to run on a wider range of HPC environments, and become easier to use in custom MPI environments, is certainly enough motivation to work on merging this PR.

To be frank, I'm swamped right now and don't have enough attention-bandwidth to give this issue the focus it deserves -- however this will change in a month after we finish preparing for and do our online HNN Workshop in early April. I should be able to dive into this after that.

A very incomplete list of the work left to do for this PR includes:

  • [ ] At minimum, any code changes need to pass a local run of make test on an installed development version of hnn-core. See our Contributing Page for how to set that up, and this section for running the tests.
    • Currently, the existing pytest tests pass, but this PR fails our ruff checks for style. Once you have installed the development version and packages for hnn-core, you can run ruff check hnn_core from the repository's top-level directory to see what errors are causing the problems. These need to be fixed, since they will cause our automated Github testing to fail as well.
  • [ ] The use of the term "alreadympi" should be replaced with something more descriptive. Similarly, we could use a good term for the non-alreadympi case, which will help us communicate about these two distinct code paths.
  • [ ] The commented-out uses of kill_proc_name('nrniv') should be re-enabled, BUT only in the non-alreadympi case. We still want to use that functionality in certain cases, or at least have a way to control it (maybe a no_kill_processes flag or something similar?).
  • [ ] All new functions like mpi_child.py::runit need to have comprehensive docstring documentation. The name should also be improved.
  • [ ] Tests! If we're going to support invoking hnn-core's MPI usage this way, then there need to be tests created inside our hnn_core/tests directory that explicitly run from inside such an external MPI invocation. I'm not sure how to do this, but maybe the easiest way is to start an MPI "host-thing" that executes some Python where the Python calls pytest directly, like https://docs.pytest.org/en/6.2.x/usage.html#calling-pytest-from-python-code .
  • [ ] Documentation: as I said in an earlier comment, once the API for this PR is settled, we need to add guidance to our documentation on how to use this functionality, including why you would want to use one type of MPI invocation vs our pre-existing mpiexec use.

asoplata avatar Mar 04 '25 16:03 asoplata

Austin,

There's no rush to get these in, from an NSG perspective. We hacked the parallel_backends.py file to use our mpi startup commandline options and made a special Tool on NSG to start up the user script in serial. Also, I retired yesterday, so I'm not sure how much time I will be able to donate for cleanup.

One note on kill_proc: a cleaner way to kill off leftover processes on shared nodes is to use the process group id to identify processes started from a given process and only kill those. That should work for both cases.

Kenneth

Sent with Proton Mail secure email.

On Tuesday, March 4th, 2025 at 8:55 AM, Austin E. Soplata @.***> wrote:

Hello Kenneth,

I spoke with my HNN colleagues yesterday, and in addition to the motivations you explained, they also informed me that the current way we spawn child processes may actually be disallowed in some HPC environments due to security policy. Indeed, this is part of the justification for an older, un-merged PR by @.***(https://github.com/rythorpe) here: #506 (we may want to merge at least some of the code from that PR into this one). Enabling parallel HNN-core to run on a wider range of HPC environments, and become easier to use in custom MPI environments, is certainly enough motivation to work on merging this PR.

To be frank, I'm swamped right now and don't have enough attention-bandwidth to give this issue the focus it deserves -- however this will change in a month after we finish preparing for and do our online HNN Workshop in early April. I should be able to dive into this after that.

A very incomplete list of the work left to do for this PR includes:

  • At minimum, any code changes need to pass a local run of make test on an installed development version of hnn-core. See our Contributing Page for how to set that up, and this section for running the tests.

  • Currently, the existing pytest tests pass, but this PR fails our ruff checks for style. Once you have installed the development version and packages for hnn-core, you can run ruff check hnn_core from the repository's top-level directory to see what errors are causing the problems. These need to be fixed, since they will cause our automated Github testing to fail as well.

  • The use of the term "alreadympi" should be replaced with something more descriptive. Similarly, we could use a good term for the non-alreadympi case, which will help us communicate about these two distinct code paths.

  • The commented-out uses of kill_proc_name('nrniv') should be re-enabled, BUT only in the non-alreadympi case. We still want to use that functionality in certain cases, or at least have a way to control it (maybe a no_kill_processes flag or something similar?).

  • All new functions like mpi_child.py::runit need to have comprehensive docstring documentation. The name should also be improved.

  • Tests! If we're going to support invoking hnn-core's MPI usage this way, then there need to be tests created inside our hnn_core/tests directory that explicitly run from inside such an external MPI invocation. I'm not sure how to do this, but maybe the easiest way is to start an MPI "host-thing" that executes some Python where the Python calls pytest directly, like https://docs.pytest.org/en/6.2.x/usage.html#calling-pytest-from-python-code .

  • Documentation: as I said in an earlier comment, once the API for this PR is settled, we need to add guidance to our documentation on how to use this functionality, including why you would want to use one type of MPI invocation vs our pre-existing mpiexec use.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

[asoplata]asoplata left a comment (jonescompneurolab/hnn-core#998)

Hello Kenneth,

I spoke with my HNN colleagues yesterday, and in addition to the motivations you explained, they also informed me that the current way we spawn child processes may actually be disallowed in some HPC environments due to security policy. Indeed, this is part of the justification for an older, un-merged PR by @.***(https://github.com/rythorpe) here: #506 (we may want to merge at least some of the code from that PR into this one). Enabling parallel HNN-core to run on a wider range of HPC environments, and become easier to use in custom MPI environments, is certainly enough motivation to work on merging this PR.

To be frank, I'm swamped right now and don't have enough attention-bandwidth to give this issue the focus it deserves -- however this will change in a month after we finish preparing for and do our online HNN Workshop in early April. I should be able to dive into this after that.

A very incomplete list of the work left to do for this PR includes:

  • At minimum, any code changes need to pass a local run of make test on an installed development version of hnn-core. See our Contributing Page for how to set that up, and this section for running the tests.

  • Currently, the existing pytest tests pass, but this PR fails our ruff checks for style. Once you have installed the development version and packages for hnn-core, you can run ruff check hnn_core from the repository's top-level directory to see what errors are causing the problems. These need to be fixed, since they will cause our automated Github testing to fail as well.

  • The use of the term "alreadympi" should be replaced with something more descriptive. Similarly, we could use a good term for the non-alreadympi case, which will help us communicate about these two distinct code paths.

  • The commented-out uses of kill_proc_name('nrniv') should be re-enabled, BUT only in the non-alreadympi case. We still want to use that functionality in certain cases, or at least have a way to control it (maybe a no_kill_processes flag or something similar?).

  • All new functions like mpi_child.py::runit need to have comprehensive docstring documentation. The name should also be improved.

  • Tests! If we're going to support invoking hnn-core's MPI usage this way, then there need to be tests created inside our hnn_core/tests directory that explicitly run from inside such an external MPI invocation. I'm not sure how to do this, but maybe the easiest way is to start an MPI "host-thing" that executes some Python where the Python calls pytest directly, like https://docs.pytest.org/en/6.2.x/usage.html#calling-pytest-from-python-code .

  • Documentation: as I said in an earlier comment, once the API for this PR is settled, we need to add guidance to our documentation on how to use this functionality, including why you would want to use one type of MPI invocation vs our pre-existing mpiexec use.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

kenneth59715 avatar Mar 05 '25 15:03 kenneth59715

Congratulations on the retirement Kenneth!!

asoplata avatar Mar 07 '25 14:03 asoplata