grasp icon indicating copy to clipboard operation
grasp copied to clipboard

rci_mpi not working with openmpi 4.0.0 on Mac systems

Open SachaSchiffmann opened this issue 6 years ago • 16 comments

Dear all,

I encountered some issues with the mpi programs of GRASP2018. I have recently upgraded openmpi to 4.0.0 on Michel's MAC computer (macOS 10.14.3) Everything was working fine before the upgrade but now I am unable to run correctly rci_mpi. The program starts correctly but gets stucks right after/during computing Breit integrals.

Here is the output:

...
Block            1    ncf =         2386  id =  1/2+
1
Computing     1093303  Rk integrals
Computing    10433675  Breit integrals of type 1

I already asked Jon about this because I knew he is also using a mac computer. He confirmed that GCC 8.2.0 + OPENMPI 3.1.3 works fine, while GCC 8.3.0 + OPENMPI 4.0.0 does not.

Please let me know if you have any idea on how to solve this. I attached the files needed to reproduce the problem.

Sacha

test_mpi.zip

SachaSchiffmann avatar Mar 14 '19 14:03 SachaSchiffmann

Dear Sasha,

I recently had problems compiling with an upgraded OpenMPI. Everything was fine if I used the old-fashioned include 'mpif.h'. The recent versions of OpenMPI use structured datatypes and are based more on concepts from Fortran08, whereas GRASP2018 is more like a Fortran95 program. It is easier to upgrade hardware than software! You should read OpenMPI documentation to understand the changes that have been made.

In your case, the code compiled but had problems at run-time. Why upgrade? If not, someone needs to find out what MPI call went wrong.

Regards, Charlotte.

-------------------------------%0D%0ACharlotte.F.Fischer%40Comcast.Net%0D%0A401 King Farm Blvd%2C %23402%0D%0ARockville%2C MD 20850%0D%0A%0D%0APhone%3A 301-963-4134

On March 14, 2019 at 7:31 AM SachaSchiffmann [email protected] wrote:

Dear all,


I encountered some issues with the mpi programs of GRASP2018.
I have recently upgraded openmpi to 4.0.0. on Michel MAC computer (macOS 10.14.3)
Everything was working fine before the upgrade but now I am unable to run correctly rci_mpi.
The program starts correctly but gets stucks right after/during computing Breit integrals.
Here is the output:


...
Block 1 ncf = 2386 id = 1/2+
1
Computing 1093303 Rk integrals
Computing 10433675 Breit integrals of type 1

I already asked Jon about this because I knew he is also using a mac computer. He confirmed that GCC 8.2.0 + OPENMPI 3.1.3 works fine, while GCC 8.3.0 + OPENMPI 4.0.0 does not.

Please let me know if you have any idea on how to solve this.
I attached the files needed to reproduce the problem.

Sacha

test_mpi.zip https://github.com/compas/grasp/files/2966880/test_mpi.zip

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub https://github.com/compas/grasp/issues/10 , or mute the thread https://github.com/notifications/unsubscribe-auth/AMatyNzYLw76na9OdGAVch0ldWAitk1uks5vWl1DgaJpZM4b0SzO .

cffischer avatar Mar 14 '19 16:03 cffischer

@cffischer In response to "Why upgrade?" We better keep the codes running with the latest version of OpenMPI right? Seems to me like an example of fundamental maintainence we just have to deal with. Everyone with an up-to-date system that compile and run e.g. rci_mpi today will have this problem. Both Mac OS and GNU/Linux systems will, more or less, automatically upgrade to OpenMPI 4 now, since it's available in the stable releases. Also, it's a pain to revert and keep multiple versions of libs for various codes. Or, we have to go back to shipping mpi libs with the grasp source code again I guess. Just some thoughts :)

Great that @SachaSchiffmann found this anyway.

No idea if the problem also occurs on Linux systems, have to check but don't dare to upgrade my main server. Anyone already on 4.0.0 who can run through the attached case above?

@WenxianLi confirms that OpenMPI 4.0.0 works on Linux (Ubuntu I guess).

jongrumer avatar Mar 14 '19 22:03 jongrumer

Grasp is an F95 program and as such uses certain libraries. Compilers for "modern Fortran" concepts are backward compatible but libraries with the modern concepts may not be compatible. The latest OpenMPI codes are based on MPI DataTypes or structures whereas earlier ones have variables. It is necessary for our codes to have the appropriate libraries. Our code does not use fancy features -- why do we need the latest OpenMPI libraries? They will not make our applications more efficient.

At a modern High-Performance Computing Center, there are many compilers and libraries available to the user. At Malmo, only gfortan 4.8 (or something like that) was available and so that is what we targetted. We made sure code would run on their platform.

IF we want to use the latest OpenMPI, then someone needs to learn about the new versions and develop our codes accordingly.

cffischer avatar Mar 14 '19 23:03 cffischer

Keeping the code compatible with newer versions of dependent libraries is a good thing, if possible. I assume that we're currently officially targeting OpenMPI 2, but e.g. Ubuntu 18.10 shipped with OpenMPI 3.1, and I wouldn't be surprised if 19.10 or 20.04 goes with 4.x.

What would be helpful is to identify where in the code the stalling occurs, figure out the problem and see if it is something we can fix without breaking OpenMPI 2 & 3 support.

mortenpi avatar Mar 15 '19 01:03 mortenpi

Hello,

Thank you all for your answers. I asked Wenxian to test this in Malmö as well. She told me that it was working fine on Malmö cluster with openmpi 4.0.0 but was failing on her personal mac laptop.

To answer to Charlotte about the upgrading, I did it because I upgraded the mac operating system from a very old one to the newest one and I thought it would be good to have the newest versions of gfortran and openmpi. Eventhough I also think it is not a bad idea to have the codes compatible with openmpi 4.0.0, I will install back an older version.

I also identify where it gets stuck. It is the subroutine genintbreit1wrap if I am not mistaken. Here is the code

CALL genintbreit1 ((myid), (nprocs), N, j2max)

! Gather integrals (and their indeces) from- and send to- all nodes

CALL gisummpi (INDTP1, N)
CALL gdsummpi (VALTP1, N)

it never goes out of the CALL to gisummpi.

SachaSchiffmann avatar Mar 15 '19 08:03 SachaSchiffmann

Sasha,

It occurred to me later that there is the issue of openmpi (which version) and then which mpi libraries are used. I suspect the latter is associated with "Use mpi" or the older Include 'mpif.h'. that is part of the cpath.f90, I believe. Unfortunately, this is not an environment variable. When I upgraded to openmpi3 and gfortran-7 I had problems that were solved by using 'mpif.h'. I suspect this is the 'backward compatability' I needed on my Linux system.

So, before you uninstall openmpi4, try changing the "Use mpi" that Jacek put into the code, though I thought the default should be the "Include mpif.h'.

Let me knw what happens.

Charlotte

-------------------------------%0D%0ACharlotte.F.Fischer%40Comcast.Net%0D%0A401 King Farm Blvd%2C %23402%0D%0ARockville%2C MD 20850%0D%0A%0D%0APhone%3A 301-963-4134

On March 15, 2019 at 1:22 AM SachaSchiffmann [email protected] wrote:

Hello,

Thank you all for your answers.
I asked Wenxian to test this in Malmö as well. She told me that it was working fine on Malmö cluster with openmpi 4.0.0 but was failing on her personal mac laptop.

To answer to Charlotte about the upgrading, I did it because I upgraded the mac operating system from a very old one to the newest one and I thought it would be good to have the newest versions of gfortran and openmpi. Eventhough I also think it is not a bad idea to have the codes compatible with openmpi 4.0.0, I will install back an older version.

I also identify where it gets stuck. It is the subroutine genintbreit1wrap if I am not mistaken.
Here is the code

CALL genintbreit1 ((myid), (nprocs), N, j2max)



! Gather integrals (and their indeces) from- and send to- all nodes



CALL gisummpi (INDTP1, N)

CALL gdsummpi (VALTP1, N)


it never goes out of the CALL to gisummpi.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/compas/grasp/issues/10#issuecomment-473198320 , or mute the thread https://github.com/notifications/unsubscribe-auth/AMatyKmBlwpXyU7DNmJfPILAWbIcF9sWks5vW1higaJpZM4b0SzO .

cffischer avatar Mar 15 '19 15:03 cffischer

Dear Charlotte,

I have tried 'use mpi' but then the mpi library does not compile anymore.

mpifort -c -O2 -fno-automatic  lodcslmpi.f90  -I ../libmod -I ../lib9290 -I . -o lodcslmpi.o
lodcslmpi.f90:53:71:

                (IQA(:,:),     NNNW*NCF,MPIX_INT1,0,MPI_COMM_WORLD,ierr)
                                                                       1
Error: There is no specific subroutine for the generic 'mpi_bcast' at (1)

I am still trying to understand why it does not compile.

Sacha

SachaSchiffmann avatar Mar 18 '19 16:03 SachaSchiffmann

Dear Sasha,

Try two things.

  1. In /src/lib/mpi90/mpi_C.f90 edit the routine to INCLUDE 'mpif.h' rather than USE mpi. Note that the former command is AFTER the IMPLICIT statement whereas the latter is BEFORE. Jacek introduced Use MPI whereas I always thought the Include would be more fail-safe for users of our code. Does this work with the latest OpenMPI 4?

  2. Assuming you want the latest OpenMPI and USE mpi (or USE mpi_08) then we may need to change some MPI calls. You have found a problem is MPI_bcast.
    Just as the Fortran language has evolved, so has MPI. A big change has been in how arguments are passed from one procedure to the next. In F77, a CALL sub(x) would pass on the "address of x" to the subroutine, "Call by reference". If X happened to be a two-dimensional array, it was the address of x(1,1) that was passe, nothing else. In fact, "sub" knew nothing about what was being passed -- all it knew was an address. It could do whatever it wanted to with that address. In F90 things are very different -- an address is passed along with all kinds of information about x such as it rank and dimension and if it is a data structure, other information. OpenMPI now uses data structures and different datatypes. So we need to go back to the syntax of "modern" MPI.

It is difficult to find clear documentation on the web, but here is one I have found useful http://mpitutorial.com/tutorials/mpi-broadcast-and-collective-communication/ that is also on GitHub somewhere. If you scroll down you see the statement syntax MPI_Bcast( void* data, int count, MPI_Datatype datatype, int root, MPI_Comm communicator) In our case, the communicator is MPI_Comm_World. We never use the ierr so maybe we should just remove it and see what happens.
Unfortunately, there may be a difference between Fortran and C, in which case my argument might not be valid. Let me know what happens.

cffischer avatar Mar 18 '19 18:03 cffischer

Does [INCLUDE 'mpif.h'] work with the latest OpenMPI 4?

According to the docs, mpif.h is still supported in OpenMPI 4.

FWIW, I can also confirm that the issue does not see to be present on Linux (Ubuntu 18.04, GCC 7.3.0, OpenMPI 4.0.0 compiled by hand). Also, as far as I can tell, the MPI_Bcast interface has not changed between 2.x and 4.x.

@SachaSchiffmann Out of curiosity, how do you install OpenMPI on MacOS. Via Homebrew?

mortenpi avatar Mar 19 '19 01:03 mortenpi

Dear all,

Maybe I wasn't clear about what I had already tried.

  1. I was originally using 'include mpif.h'. The codes compiled fine but it gets stuck at the executation.
  2. Then I tried 'USE mpi' instead, as you suggested. The codes did not compile anymore with the following error message:
Error: There is no specific subroutine for the generic 'mpi_bcast' at (1)

Charlotte, I have tried to eliminate the ierrparameter but according to the documentation (MPI_Bast documentation)::

Fortran Syntax

USE MPI
! or the older form: INCLUDE ’mpif.h’
MPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM, IERROR)
    <type>    BUFFER(*)
    INTEGER    COUNT, DATATYPE, ROOT, COMM, IERROR

there is no problem to keep it. The error does not go away...

@mortenpi I indeed used Homebrew to install openmpi and gcc.

SachaSchiffmann avatar Mar 19 '19 14:03 SachaSchiffmann

Have you found a good documentation of the Fortran version of MPI? The ones I like are for C which is not quite the same a Fortran in certain details.

In lodcslmpi we have define MPIX_INT1 to be of type "Byte". I think what the compiler is saying is that the generic MPI_CAST has no version for this type of data.

So look at the comments in the code. Jacek has added comments to the effect that we should try using the type mpi_integer1. Somewhere in our different "kinds" we have defined the meaning of "bytes". We need to match this with the MPI kinds. Somewhere there is documentation on the various kinds recognized by MPI.

Maybe you should discuss this with Jacek.

cffischer avatar Mar 19 '19 17:03 cffischer

I indeed used Homebrew to install openmpi and gcc.

I would see what happens if you compile OpenMPI by hand and link against that. Would rule out problems with the Homebrew install scripts.

mortenpi avatar Mar 19 '19 21:03 mortenpi

Hello,

I kept looking but I still don't know why I am unable to have the programs running correctly with the latest version of openmpi. I have tried to install it by hand but it crashes at the compilation. Therefore I installed back openmpi 3.1.3 and everything is working fine again. I will keep investigating more about openmpi4.0.0 and if I cannot find a solution within a few days, I will send an email to Jacek.

@jongrumer if you have some time to test this as well, it would be great.

SachaSchiffmann avatar Mar 22 '19 13:03 SachaSchiffmann

@SachaSchiffmann Unfortunately I don't have time at the moment. Maybe maybe, but will go Max Planck and work on a kilonovae-code now for 2 weeks, and the to the Lab Astro conference at Cambridge. Hope that things have calmed down around mid april.

jongrumer avatar Mar 26 '19 09:03 jongrumer

@SachaSchiffmann What's the status on this issue? Any progress?

jongrumer avatar Apr 24 '19 22:04 jongrumer

@jongrumer No progress so far. I had to install back another version.

SachaSchiffmann avatar Apr 25 '19 07:04 SachaSchiffmann