amuse icon indicating copy to clipboard operation
amuse copied to clipboard

phiGRAPE MPI support is broken

Open LourensVeen opened this issue 1 year ago • 3 comments

Describe the bug While converting phiGRAPE to the new build system, I tried removing the non-MPI worker and just putting in the MPI worker as the default, seeing as we're assuming that MPI is available everywhere anyway. This caused some of the tests to fail, in particular test11, which removes particles and then adds new particles representing merged pairs.

The original tests run only the non-MPI worker, except for for tests 15, 16 and 19, which are run with multiple workers but don't remove or add particles. So it looks like this was never tested. I haven't been able to find the exact cause, but it looks like adding and removing particles was simply never implemented correctly for the MPI-enabled case.

To Reproduce Modify TestPhigrape::test11 to run with two workers rather than one.

Expected behavior The code should give correct results rather than getting its internal state messed up, and thus pass the test.

Environment (please complete the following information):

  • OS and version: Kubuntu 22.04
  • Compiler: GFortran 11.4.0

Additional context I'll build the non-MPI worker as well for now in the new build system and disable the MPI one in interface.py to avoid giving people incorrect results.

Another thing to take away from this is that the tests should really be set up so that every test is run for every worker, rather than running most tests only on the default worker and then adding one or two for the other workers. The latter greatly reduces test coverage for the non-default workers, while not saving any work. (Except that you'd then have to fix all the additional bugs you find, but that's better than incorrect results.)

LourensVeen avatar Nov 04 '24 19:11 LourensVeen

@LourensVeen, I did what you suggested (after modifying the Makefile of phi grape to allow for multiple data formats), I modified test11 as you suggested (number_of_workers = 2) and performed the suite of tests. The outcome is this :

OSX 15.2 / gcc14 / python3.12.8 from Macports ====> (Amuse_15.2) cmb@FudiProM2(/opt/work/cmb/Amuse/Amuse_15.2){94} > mpiexec -n 1 --oversubscribe pytest --pest_phigrape [obas-rech-cmb.astro.unistra.fr:24569] pmix_mca_base_component_repository_open: unable to open /opt/locaress_zlib.so: File /opt/local/lib/openmpi-mp/pmix/pmix_mca_pcompress_zlib.so.sl not found (ignored) ....
============================================= test session starts ======================== platform darwin -- Python 3.12.8, pytest-8.3.4, pluggy-1.5.0 rootdir: /System/Volumes/Data/Work/cmb/Amuse/Amuse_15.2 plugins: anyio-4.8.0 collected 26 items

[obas-rech-cmb.astro.unistra.fr:24593] pmix_mca_base_component_repository_open: unable to open /opt/locaress_zlib.so: File /opt/local/lib/openmpi-mp/pmix/pmix_mca_pcompress_zlib.so.sl not found (ignored) ... [obas-rech-cmb.astro.unistra.fr:24594] pmix_mca_base_component_repository_open: unable to open /opt/localurm.so: File /opt/local/lib/openmpi-mp/pmix/pmix_mca_prm_slurm.so.sl not found (ignored) . ........... Set stopping condition 0 Set stopping condition 0 Set stopping condition 1 Set stopping condition 0 ...............

===================================== warnings summary ========================= ../../../../../../../opt/work/cmb/Amuse/Amuse_15.2/amuse-source/src/amuse/support/options.py:11 /opt/work/cmb/Amuse/Amuse_15.2/amuse-source/src/amuse/support/options.py:11: DeprecationWarning: pkg_re https://setuptools.pypa.io/en/latest/pkg_resources.html import pkg_resources

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ======================================= 26 passed, 1 warning in 2.67s ===================

So apart from addressing new file on the system (new OSX security issues ??) I don't have any test failure.

christianboily avatar Jan 15 '25 14:01 christianboily

Hm, interesting. It's been a while since I looked at this, and it may be that I misunderstood something. I'll have another go at it after the new release, see if I can either get it to work or get a better understanding of what the problem is. Thanks for testing!

LourensVeen avatar Jan 16 '25 09:01 LourensVeen

I ran into this again while testing the new installer more, and I think I've found a problem: the way I was injecting values into the phigrape Makefile didn't pass in the right flags, and that caused MPI and non-MPI code to get mixed up.

The behaviour of make changes a bit depending on whether variables were set in the environment or not, and I misunderstood how overriding works (make -C src VAR="something" turns out to be a lot more aggressive than I thought...).

So now I can run test11 with 2 workers, but on the other hand, if I set the MPI worker as the default it still fails some of the tests:

FAILED phigrape/tests/test_phigrape.py::TestMPIInterface::test7 - AssertionError: nan != 0.5
FAILED phigrape/tests/test_phigrape.py::TestPhigrape::test10 - AssertionError: False is not true
FAILED phigrape/tests/test_phigrape.py::TestPhigrape::test18 - AssertionError: 38.0 mass != 23 mass within 7 places
FAILED phigrape/tests/test_phigrape.py::TestPhigrape::test4 - AssertionError: 15.0 kg != 17.0 kg

Could you try any of these tests with more than one worker?

LourensVeen avatar Jan 23 '25 21:01 LourensVeen