hnn-core icon indicating copy to clipboard operation
hnn-core copied to clipboard

CIs for MSMPI

Open jasmainak opened this issue 2 years ago • 8 comments

closes #589

It looks like we can inherit the github action from mpi4py :)

jasmainak avatar Jan 20 '23 04:01 jasmainak

Codecov Report

Merging #590 (ef536c5) into master (2e53827) will decrease coverage by 0.08%. The diff coverage is n/a.

:exclamation: Current head ef536c5 differs from pull request most recent head f41bba4. Consider uploading reports for the commit f41bba4 to get more accurate results

@@            Coverage Diff             @@
##           master     #590      +/-   ##
==========================================
- Coverage   92.00%   91.93%   -0.08%     
==========================================
  Files          22       22              
  Lines        4253     4253              
==========================================
- Hits         3913     3910       -3     
- Misses        340      343       +3     
Impacted Files Coverage Δ
hnn_core/parallel_backends.py 82.22% <0.00%> (-0.84%) :arrow_down:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov-commenter avatar Jan 20 '23 05:01 codecov-commenter

It's pretty hard for me to debug any further without a Windows machine. @rythorpe feel free to take over. But if you feel it's too involved, we can move on and work on it some time after the release as well.

jasmainak avatar Jan 20 '23 07:01 jasmainak

Ryan reminded me of this, I then ran it in my local fork and it works. So I guess it might be a cache problem? That's the only difference I can think of now. @jasmainak could you try to update the cache key to check?

chenghuzi avatar Jan 22 '23 21:01 chenghuzi

I'll try to get around to testing this out later this week.

rythorpe avatar Jan 23 '23 22:01 rythorpe

I've tried debugging this for a few hours on Windows 10 without much luck. I'm pretty sure this is some sort of unicode encoding/decoding error on the child process's side, however, it is extremely tricky to trace given that each mpi_child process is actually a child process within a child process (i.e., Popen starts a master subprocess that performs a system call to mpiexec, which in turn instantiates multiple parallel MPI processes). It's hard to know how and where a raw string pathname is passed to the system where backslashes get interpreted as escaped characters, and I suspect that one particular pathname (mpi_child.py) gets re-interpreted on multiple occasions throughout it's trajectory before finally getting called by the python interpreter.

On the bright side, I can confirm that MSMPI get's installed with NEURON on Windows and passes our parallel backend installation test (with removal of the backslashes of course).

rythorpe avatar Feb 09 '23 04:02 rythorpe

Any thoughts @jasmainak?

rythorpe avatar Feb 09 '23 17:02 rythorpe

Does this work or help: https://github.com/jonescompneurolab/hnn-core/pull/506 since there is one less nesting level to worry about?

The way I would debug this is by peppering the mpi_child.py code with dumb print (e.g., print('hi1'), print('hi2') etc) statements ... then you can figure out where exactly the failure occurs ...

If this is too much of a bother, I would say that for 0.3 release, we should prioritize:

  1. The GUI ... did you figure out the issue with the titles?
  2. Release Windows without MPI support for now since it is more for cluster set ups which will likely be running Linux

jasmainak avatar Feb 09 '23 20:02 jasmainak

Does this work or help: #506 since there is one less nesting level to worry about?

That's a good idea. I'll try it out.

The way I would debug this is by peppering the mpi_child.py code with dumb print (e.g., print('hi1'), print('hi2') etc) statements ... then you can figure out where exactly the failure occurs ...

That's pretty much what I've done, except that the error occurs before mpi_child.py gets called.

If this is too much of a bother, I would say that for 0.3 release, we should prioritize:

  1. The GUI ... did you figure out the issue with the titles?

You mean when running it on Windows? If so, it was just a weird dependency problem that I resolved by creating a new conda environment.

  1. Release Windows without MPI support for now since it is more for cluster set ups which will likely be running Linux

This doesn't make much sense to me since the vast majority of Windows users are new users or workshop participants who will need to run single-trial simulations in a reasonable amount of time (e.g., when running the tutorials synchronously).

rythorpe avatar Feb 10 '23 18:02 rythorpe