rci_mpi not working with openmpi 4.0.0 on Mac systems
Dear all,
I encountered some issues with the mpi programs of GRASP2018.
I have recently upgraded openmpi to 4.0.0 on Michel's MAC computer (macOS 10.14.3)
Everything was working fine before the upgrade but now I am unable to run correctly rci_mpi.
The program starts correctly but gets stucks right after/during computing Breit integrals.
Here is the output:
...
Block 1 ncf = 2386 id = 1/2+
1
Computing 1093303 Rk integrals
Computing 10433675 Breit integrals of type 1
I already asked Jon about this because I knew he is also using a mac computer. He confirmed that GCC 8.2.0 + OPENMPI 3.1.3 works fine, while GCC 8.3.0 + OPENMPI 4.0.0 does not.
Please let me know if you have any idea on how to solve this. I attached the files needed to reproduce the problem.
Sacha
Dear Sasha,
I recently had problems compiling with an upgraded OpenMPI. Everything was fine if I used the old-fashioned include 'mpif.h'. The recent versions of OpenMPI use structured datatypes and are based more on concepts from Fortran08, whereas GRASP2018 is more like a Fortran95 program. It is easier to upgrade hardware than software! You should read OpenMPI documentation to understand the changes that have been made.
In your case, the code compiled but had problems at run-time. Why upgrade? If not, someone needs to find out what MPI call went wrong.
Regards, Charlotte.
-------------------------------%0D%0ACharlotte.F.Fischer%40Comcast.Net%0D%0A401 King Farm Blvd%2C %23402%0D%0ARockville%2C MD 20850%0D%0A%0D%0APhone%3A 301-963-4134
On March 14, 2019 at 7:31 AM SachaSchiffmann [email protected] wrote:
Dear all, I encountered some issues with the mpi programs of GRASP2018. I have recently upgraded openmpi to 4.0.0. on Michel MAC computer (macOS 10.14.3) Everything was working fine before the upgrade but now I am unable to run correctly rci_mpi. The program starts correctly but gets stucks right after/during computing Breit integrals. Here is the output: ... Block 1 ncf = 2386 id = 1/2+ 1 Computing 1093303 Rk integrals Computing 10433675 Breit integrals of type 1 I already asked Jon about this because I knew he is also using a mac computer. He confirmed that GCC 8.2.0 + OPENMPI 3.1.3 works fine, while GCC 8.3.0 + OPENMPI 4.0.0 does not. Please let me know if you have any idea on how to solve this. I attached the files needed to reproduce the problem. Sacha test_mpi.zip https://github.com/compas/grasp/files/2966880/test_mpi.zip — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/compas/grasp/issues/10 , or mute the thread https://github.com/notifications/unsubscribe-auth/AMatyNzYLw76na9OdGAVch0ldWAitk1uks5vWl1DgaJpZM4b0SzO .
@cffischer In response to "Why upgrade?" We better keep the codes running with the latest version of OpenMPI right? Seems to me like an example of fundamental maintainence we just have to deal with. Everyone with an up-to-date system that compile and run e.g. rci_mpi today will have this problem. Both Mac OS and GNU/Linux systems will, more or less, automatically upgrade to OpenMPI 4 now, since it's available in the stable releases. Also, it's a pain to revert and keep multiple versions of libs for various codes. Or, we have to go back to shipping mpi libs with the grasp source code again I guess. Just some thoughts :)
Great that @SachaSchiffmann found this anyway.
No idea if the problem also occurs on Linux systems, have to check but don't dare to upgrade my main server. Anyone already on 4.0.0 who can run through the attached case above?
@WenxianLi confirms that OpenMPI 4.0.0 works on Linux (Ubuntu I guess).
Grasp is an F95 program and as such uses certain libraries. Compilers for "modern Fortran" concepts are backward compatible but libraries with the modern concepts may not be compatible. The latest OpenMPI codes are based on MPI DataTypes or structures whereas earlier ones have variables. It is necessary for our codes to have the appropriate libraries. Our code does not use fancy features -- why do we need the latest OpenMPI libraries? They will not make our applications more efficient.
At a modern High-Performance Computing Center, there are many compilers and libraries available to the user. At Malmo, only gfortan 4.8 (or something like that) was available and so that is what we targetted. We made sure code would run on their platform.
IF we want to use the latest OpenMPI, then someone needs to learn about the new versions and develop our codes accordingly.
Keeping the code compatible with newer versions of dependent libraries is a good thing, if possible. I assume that we're currently officially targeting OpenMPI 2, but e.g. Ubuntu 18.10 shipped with OpenMPI 3.1, and I wouldn't be surprised if 19.10 or 20.04 goes with 4.x.
What would be helpful is to identify where in the code the stalling occurs, figure out the problem and see if it is something we can fix without breaking OpenMPI 2 & 3 support.
Hello,
Thank you all for your answers. I asked Wenxian to test this in Malmö as well. She told me that it was working fine on Malmö cluster with openmpi 4.0.0 but was failing on her personal mac laptop.
To answer to Charlotte about the upgrading, I did it because I upgraded the mac operating system from a very old one to the newest one and I thought it would be good to have the newest versions of gfortran and openmpi. Eventhough I also think it is not a bad idea to have the codes compatible with openmpi 4.0.0, I will install back an older version.
I also identify where it gets stuck. It is the subroutine genintbreit1wrap if I am not mistaken. Here is the code
CALL genintbreit1 ((myid), (nprocs), N, j2max)
! Gather integrals (and their indeces) from- and send to- all nodes
CALL gisummpi (INDTP1, N)
CALL gdsummpi (VALTP1, N)
it never goes out of the CALL to gisummpi.
Sasha,
It occurred to me later that there is the issue of openmpi (which version) and then which mpi libraries are used. I suspect the latter is associated with "Use mpi" or the older Include 'mpif.h'. that is part of the cpath.f90, I believe. Unfortunately, this is not an environment variable. When I upgraded to openmpi3 and gfortran-7 I had problems that were solved by using 'mpif.h'. I suspect this is the 'backward compatability' I needed on my Linux system.
So, before you uninstall openmpi4, try changing the "Use mpi" that Jacek put into the code, though I thought the default should be the "Include mpif.h'.
Let me knw what happens.
Charlotte
-------------------------------%0D%0ACharlotte.F.Fischer%40Comcast.Net%0D%0A401 King Farm Blvd%2C %23402%0D%0ARockville%2C MD 20850%0D%0A%0D%0APhone%3A 301-963-4134
On March 15, 2019 at 1:22 AM SachaSchiffmann [email protected] wrote:
Hello, Thank you all for your answers. I asked Wenxian to test this in Malmö as well. She told me that it was working fine on Malmö cluster with openmpi 4.0.0 but was failing on her personal mac laptop. To answer to Charlotte about the upgrading, I did it because I upgraded the mac operating system from a very old one to the newest one and I thought it would be good to have the newest versions of gfortran and openmpi. Eventhough I also think it is not a bad idea to have the codes compatible with openmpi 4.0.0, I will install back an older version. I also identify where it gets stuck. It is the subroutine genintbreit1wrap if I am not mistaken. Here is the code CALL genintbreit1 ((myid), (nprocs), N, j2max) ! Gather integrals (and their indeces) from- and send to- all nodes CALL gisummpi (INDTP1, N) CALL gdsummpi (VALTP1, N) it never goes out of the CALL to gisummpi. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/compas/grasp/issues/10#issuecomment-473198320 , or mute the thread https://github.com/notifications/unsubscribe-auth/AMatyKmBlwpXyU7DNmJfPILAWbIcF9sWks5vW1higaJpZM4b0SzO .
Dear Charlotte,
I have tried 'use mpi' but then the mpi library does not compile anymore.
mpifort -c -O2 -fno-automatic lodcslmpi.f90 -I ../libmod -I ../lib9290 -I . -o lodcslmpi.o
lodcslmpi.f90:53:71:
(IQA(:,:), NNNW*NCF,MPIX_INT1,0,MPI_COMM_WORLD,ierr)
1
Error: There is no specific subroutine for the generic 'mpi_bcast' at (1)
I am still trying to understand why it does not compile.
Sacha
Dear Sasha,
Try two things.
-
In /src/lib/mpi90/mpi_C.f90 edit the routine to INCLUDE 'mpif.h' rather than USE mpi. Note that the former command is AFTER the IMPLICIT statement whereas the latter is BEFORE. Jacek introduced Use MPI whereas I always thought the Include would be more fail-safe for users of our code. Does this work with the latest OpenMPI 4?
-
Assuming you want the latest OpenMPI and USE mpi (or USE mpi_08) then we may need to change some MPI calls. You have found a problem is MPI_bcast.
Just as the Fortran language has evolved, so has MPI. A big change has been in how arguments are passed from one procedure to the next. In F77, a CALL sub(x) would pass on the "address of x" to the subroutine, "Call by reference". If X happened to be a two-dimensional array, it was the address of x(1,1) that was passe, nothing else. In fact, "sub" knew nothing about what was being passed -- all it knew was an address. It could do whatever it wanted to with that address. In F90 things are very different -- an address is passed along with all kinds of information about x such as it rank and dimension and if it is a data structure, other information. OpenMPI now uses data structures and different datatypes. So we need to go back to the syntax of "modern" MPI.
It is difficult to find clear documentation on the web, but here is one I have found useful
http://mpitutorial.com/tutorials/mpi-broadcast-and-collective-communication/
that is also on GitHub somewhere. If you scroll down you see the statement syntax
MPI_Bcast(
void* data,
int count,
MPI_Datatype datatype,
int root,
MPI_Comm communicator)
In our case, the communicator is MPI_Comm_World. We never use the ierr so maybe we should just
remove it and see what happens.
Unfortunately, there may be a difference between Fortran and C, in which case my argument might not be valid. Let me know what happens.
Does [INCLUDE 'mpif.h'] work with the latest OpenMPI 4?
According to the docs, mpif.h is still supported in OpenMPI 4.
FWIW, I can also confirm that the issue does not see to be present on Linux (Ubuntu 18.04, GCC 7.3.0, OpenMPI 4.0.0 compiled by hand). Also, as far as I can tell, the MPI_Bcast interface has not changed between 2.x and 4.x.
@SachaSchiffmann Out of curiosity, how do you install OpenMPI on MacOS. Via Homebrew?
Dear all,
Maybe I wasn't clear about what I had already tried.
- I was originally using 'include mpif.h'. The codes compiled fine but it gets stuck at the executation.
- Then I tried 'USE mpi' instead, as you suggested. The codes did not compile anymore with the following error message:
Error: There is no specific subroutine for the generic 'mpi_bcast' at (1)
Charlotte, I have tried to eliminate the ierrparameter but according to the documentation (MPI_Bast documentation)::
Fortran Syntax
USE MPI
! or the older form: INCLUDE ’mpif.h’
MPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM, IERROR)
<type> BUFFER(*)
INTEGER COUNT, DATATYPE, ROOT, COMM, IERROR
there is no problem to keep it. The error does not go away...
@mortenpi I indeed used Homebrew to install openmpi and gcc.
Have you found a good documentation of the Fortran version of MPI? The ones I like are for C which is not quite the same a Fortran in certain details.
In lodcslmpi we have define MPIX_INT1 to be of type "Byte". I think what the compiler is saying is that the generic MPI_CAST has no version for this type of data.
So look at the comments in the code. Jacek has added comments to the effect that we should try using the type mpi_integer1. Somewhere in our different "kinds" we have defined the meaning of "bytes". We need to match this with the MPI kinds. Somewhere there is documentation on the various kinds recognized by MPI.
Maybe you should discuss this with Jacek.
I indeed used Homebrew to install openmpi and gcc.
I would see what happens if you compile OpenMPI by hand and link against that. Would rule out problems with the Homebrew install scripts.
Hello,
I kept looking but I still don't know why I am unable to have the programs running correctly with the latest version of openmpi. I have tried to install it by hand but it crashes at the compilation. Therefore I installed back openmpi 3.1.3 and everything is working fine again. I will keep investigating more about openmpi4.0.0 and if I cannot find a solution within a few days, I will send an email to Jacek.
@jongrumer if you have some time to test this as well, it would be great.
@SachaSchiffmann Unfortunately I don't have time at the moment. Maybe maybe, but will go Max Planck and work on a kilonovae-code now for 2 weeks, and the to the Lab Astro conference at Cambridge. Hope that things have calmed down around mid april.
@SachaSchiffmann What's the status on this issue? Any progress?
@jongrumer No progress so far. I had to install back another version.