dicodile
dicodile copied to clipboard
Test mpi versions
Runs the tests:
- on ubuntu-18.04 and ubuntu 20.04
- using mpich and openmpi implementations
- system and conda installation of mpi implementations
Tests with openmpi on ubuntu-18.04 fails due to #12.
Tests with mpich on both ubuntu-18.04 and ubuntu 20.04 fail due to #19.
Codecov Report
Merging #20 (75a5004) into main (0aad2ea) will not change coverage. The diff coverage is
n/a
.
:exclamation: Current head 75a5004 differs from pull request most recent head 909cdcf. Consider uploading reports for the commit 909cdcf to get more accurate results
@@ Coverage Diff @@
## main #20 +/- ##
=======================================
Coverage 74.29% 74.29%
=======================================
Files 41 41
Lines 2587 2587
=======================================
Hits 1922 1922
Misses 665 665
Flag | Coverage Δ | |
---|---|---|
unittests | 74.29% <ø> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 0aad2ea...909cdcf. Read the comment docs.
I am not sure why it is still in fail fast. ~~Did you rebased on master
?~~ I saw you did, not sure why the tests are stopped then.
It would be nice to have all these tests, potentially with xfail on the configurations known to cause problems?
I am not sure why it is still in fail fast. ~Did you rebased on
master
?~ I saw you did, not sure why the tests are stopped then.It would be nice to have all these tests, potentially with xfail on the configurations known to cause problems?
They are stopped because of time out.
The tests with mpich hang at a point due to #19, then it waits until max timeout for github actions. When it is over, they are cancelled.
@tomMoral The main problem for tests with mpich is that we need to run the tests with mpiexec -np 1 pytest ..
due to https://github.com/pmodels/mpich/issues/4853 .
But when tests are run with mpiexec (for both openmpi and mpich) there is a problem with stopping spawned processes. I do not know how to release resources started by mpi for the tests. (The problem appears only when running tests) Do you have any idea?
the problem seems to be on the init of MPI
with an issue on argument no?
It seems that the process hangs just before calling dicodile/tests/test_dicodile.py::test_dicodile
.
I think one of the issue is that
from mpi4py import MPI
will only return when MPI_Init
complete. This call is triggered by the import
so it is hard to think of a way to detect the failure if the call itself is not.
One way to detect this would be to wrap the import with a faulthandler.dump_traceback_later(timeout=120)
and a faulthalder.cancel_dump_traceback_later()
that would exit if it hangs for more than 2m with info that might help with debugging.
WDYT?
the problem seems to be on the init of
MPI
with an issue on argument no?It seems that the process hangs just before calling
dicodile/tests/test_dicodile.py::test_dicodile
.I think one of the issue is that
from mpi4py import MPI
will only return whenMPI_Init
complete. This call is triggered by theimport
so it is hard to think of a way to detect the failure if the call itself is not.One way to detect this would be to wrap the import with a
faulthandler.dump_traceback_later(timeout=120)
and afaulthalder.cancel_dump_traceback_later()
that would exit if it hangs for more than 2m with info that might help with debugging.WDYT?
@tomMoral As far as I understand, this is message is due to Singleton feature not being implemented in mpich, see mpich issue on github.
details are explained in #19.
I think with mpich we need to run the tests with:
mpirun -np 1 --host localhost:16 pytest
Note: Actually we can use the same command for both mpich and openmpi. As hostfile format for mpich and openmpi are not the same host localhost:16
would avoid to set a hostfile.
When we use the above command with:
- openmpi: All tests pass, however it cannot stop the processes spawned by the last test, it hangs.
- mpich: some test_dicodile pass, but like openmpi it cannot stop spawned processes. test_dicod has another problem.
I think the openmpi version should be able to stop spawned processes properly. That makes me think that the code to stop spawned processes might not be reliable.
@tomMoral I tried using mpich with a very simple MPI program that spawns a number of processes (gets the hostfile from env.) to see if the problem arises from dicodile code.
With openmpi I can run the prog as:
python prog.py
If I do the same with mpich, I get the above error; ie. unrecognized argument pmi_args
. I need to run it as:
mpirun -np 1 python prog.py
I think this is really due to Singleton not being implemented in mpich.
I propose to change the testing command to
mpirun -np 1 --host localhost:16 python -m pytest
and fix the hanging problem and other possible problems afterwards.
WDYT?