OpenCoarrays icon indicating copy to clipboard operation
OpenCoarrays copied to clipboard

Defect: Performance issue on multiple nodes

Open gutmann opened this issue 6 years ago • 16 comments

Defect/Bug Report

Performance issue on multiple nodes. I thought I'd add a comment here on a performance issue I'm seeing at the moment using coarrays in derived types. I haven't chased down anything more specific, but using the coarray-icar test-ideal, and a small (200 x 200 x 20) problem size, I'm seeing huge slow downs across multiple nodes. This was not present in opencoarrays 1.9.1 (with gcc 6.3) and it is not present with intel.

I initially thought this could be related to #556 but now think this is completely separate since it is internode communication and thus will require MPI calls.

Coarray_icar uses non-blocking sends on allocatable coarrays embedded within derived types. It first processes halo grid cells, then sends them, then processes internal grid cells, then syncs with it's neighbors before reading the halos that were sent to it.

  • OpenCoarrays Version:
  • Fortran Compiler: gfortran 8.1
  • C compiler used for building lib: gcc 8.1
  • Installation method: install.sh
  • Output of uname -a:Linux cheyenne1 3.12.62-60.64.8-default #1 SMP Tue Oct 18 12:21:38 UTC 2016 (42e0a66) x86_64 x86_64 x86_64 GNU/Linux
  • MPI library being used: MPICH 3.2.1
  • Machine architecture and number of physical cores: SGI Xeons+infiniband 36 cores / node cheyenne
  • Version of CMake: 3.9.1

Observed Behavior

Performance gets worse when multiple nodes are used

Expected Behavior

Performance gets better when multiple nodes are used

Steps to Reproduce

git clone https://github.com/gutmann/coarray_icar
cd coarray_icar/src/tests
make MODE=fast

# edit input_parameters
cat > input-parameters.txt <<EOF
&grid nx=200,ny=200,nz=20 /
EOF

# get access to multiple nodes
for (( i=36; i<=144; i+=36 )); do
    cafrun -n $i ./test-ideal | grep "Model run time"
done

Example results for Opencoarrays 2.1, 1.9.1 and intel (more details on each below)

Images OpenCoarrays 2.1 OpenCoarrays 1.9.1 Intel 18.0.1
36 14.7 16.3 11.3
72 105 8.9 5.8
144 140 4.6 3.2
720 170 1.4 0.94

All times in seconds. This is just the core runtime of the numerics not any of the initialization time.

OpenCoarrays 2.1

gfortran/gcc 8.1 (built via opencoarrays install.sh)
mpich 3.2.1 (built via opencoarrays install.sh)
opencoarrays 2.1.0-31-gc0e3ffb (with fprintf statements commented out in mpi/mpi.c)

OpenCoarrays 1.9.1

gfortran/gcc 6.3
MPT 2.15f
opencoarrays 1.9.1

Intel 18.0.1

ifort 18.0.1*
iMPI 2018.1.163

gutmann avatar Jul 03 '18 18:07 gutmann

Note, I have logged into the remote nodes when the code is running and I see the process correctly running on all remote nodes, so there isn't anything weird going on wherein it is trying to run all images on one node or anything.

gutmann avatar Jul 03 '18 18:07 gutmann

Hi Ethan, please make to sure to use -O0 while compiling with Intel and GFortran.

On Tue, Jul 3, 2018 at 12:42 PM Ethan Gutmann [email protected] wrote:

Note, I have logged into the remote nodes when the code is running and I see the process correctly running on all remote nodes, so there isn't anything weird going on wherein it is trying to run all images on one node or anything.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sourceryinstitute/OpenCoarrays/issues/560#issuecomment-402255598, or mute the thread https://github.com/notifications/unsubscribe-auth/AIDO19o8R55wo_0AjH0Ut1iVrGS5IyX5ks5uC7sPgaJpZM4VBcYD .

--

Alessandro Fanfarillo

afanfa avatar Jul 03 '18 18:07 afanfa

please make to sure to use -O0 while compiling with Intel and GFortran.

@afanfa just out of curiosity, why?

zbeekman avatar Jul 03 '18 19:07 zbeekman

Thanks @afanfa, make MODE=fast compiles with caf -fopenmp -lgomp -Ofast -cpp -DUSE_ASSERTIONS=.false.. I am setting OMP_NUM_THREADS=1 in the tests above to disable OpenMP threading, though I suppose it could still be doing something behind the scenes. Testing without openMP enabled at all now (and -O0 instead of -Ofast)

gutmann avatar Jul 03 '18 19:07 gutmann

With -O0 and no openMP directives:

Images Runtime
36 26
72 116

gutmann avatar Jul 03 '18 19:07 gutmann

Intel is much better than GFortran in optimizing code (I'm referring to computation, not communication). If the code has even a small part of computation Intel will almost always produce better results when the same optimization level is applied (unless it's -O0)

On Tue, Jul 3, 2018 at 1:00 PM zbeekman [email protected] wrote:

please make to sure to use -O0 while compiling with Intel and GFortran.

@afanfa https://github.com/afanfa just out of curiosity, why?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sourceryinstitute/OpenCoarrays/issues/560#issuecomment-402260712, or mute the thread https://github.com/notifications/unsubscribe-auth/AIDO1-sYhLs53Hhh0YjPW2ZaqJkvBU2jks5uC79PgaJpZM4VBcYD .

--

Alessandro Fanfarillo

afanfa avatar Jul 03 '18 19:07 afanfa

Yes, I don't care too much about the relative performance of intel vs gfortran, I'm more interested in the scaling of the two (using 2x more cores with intel takes 1/2 the time, but with gfortran+opencoarrays it takes ~>5x longer here)

gutmann avatar Jul 03 '18 19:07 gutmann

@afanfa don't worry I suspect the performance regression happened in PR #427 or other "recent" changes and not in code you wrote 😉

zbeekman avatar Jul 03 '18 19:07 zbeekman

@gutmann I fixed your first table and deleted the (now extraneous) comment with the corrected formatting; the issue was you need a line of whitespace before tables.

zbeekman avatar Jul 03 '18 19:07 zbeekman

Thanks Zaak. ;-)

On Tue, Jul 3, 2018 at 1:08 PM zbeekman [email protected] wrote:

@afanfa https://github.com/afanfa don't worry I suspect the performance regression happened in PR #427 https://github.com/sourceryinstitute/OpenCoarrays/issues/427 and not in code you wrote 😉

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sourceryinstitute/OpenCoarrays/issues/560#issuecomment-402262975, or mute the thread https://github.com/notifications/unsubscribe-auth/AIDO13MtabemK_E00nWsjODsfA9LeR_Qks5uC8E8gaJpZM4VBcYD .

--

Alessandro Fanfarillo

afanfa avatar Jul 03 '18 19:07 afanfa

Sorry I cannot test the scalability with this many cores.

I'm pretty sure this is NOT it, but can you try the following patch ?

--- exchangeable_implementation_.f90	2018-07-03 21:14:56.723202212 +0200
+++ exchangeable_implementation.f90	2018-07-03 21:00:28.527662543 +0200
@@ -138,6 +138,7 @@

  module subroutine put_north(this)
      class(exchangeable_t), intent(inout) :: this
+      real, allocatable :: buf(:, :, :)
      integer :: n, nx
      n = ubound(this%local,3)
      nx = size(this%local,1)
@@ -150,12 +151,13 @@
      end if

      !dir$ pgas defer_sync
-      this%halo_south_in(1:nx,:,1:halo_size)[north_neighbor] = this%local(:,:,n-halo_size*2+1:n-halo_size)
+      buf = this%local(:,:,n-halo_size*2+1:n-halo_size)
+      this%halo_south_in(1:nx,:,1:halo_size)[north_neighbor] = buf
  end subroutine

  module subroutine put_south(this)
      class(exchangeable_t), intent(inout) :: this
-
+      real, allocatable :: buf(:, :, :)
      integer :: start, nx
      start = lbound(this%local,3)
      nx = size(this%local,1)
@@ -167,31 +169,35 @@
                     "put_south: conformable halo_north_in and local " )
      end if
      !dir$ pgas defer_sync
-      this%halo_north_in(1:nx,:,1:halo_size)[south_neighbor] = this%local(:,:,start+halo_size:start+halo_size*2-1)
+      buf = this%local(:,:,start+halo_size:start+halo_size*2-1)
+      this%halo_north_in(1:nx,:,1:halo_size)[south_neighbor] = buf
  end subroutine

  module subroutine retrieve_north_halo(this)
      class(exchangeable_t), intent(inout) :: this
-
+      real, allocatable :: buf(:, :, :)
      integer :: n, nx
      n = ubound(this%local,3)
      nx = size(this%local,1)

-      this%local(:,:,n-halo_size+1:n) = this%halo_north_in(:nx,:,1:halo_size)
+      buf = this%halo_north_in(:nx,:,1:halo_size)
+      this%local(:,:,n-halo_size+1:n) = buf
  end subroutine

  module subroutine retrieve_south_halo(this)
      class(exchangeable_t), intent(inout) :: this
-
+      real, allocatable :: buf(:, :, :)
      integer :: start, nx
      start = lbound(this%local,3)
      nx = size(this%local,1)

-      this%local(:,:,start:start+halo_size-1) = this%halo_south_in(:nx,:,1:halo_size)
+      buf = this%halo_south_in(:nx,:,1:halo_size)
+      this%local(:,:,start:start+halo_size-1) = buf
  end subroutine

  module subroutine put_east(this)
      class(exchangeable_t), intent(inout) :: this
+      real, allocatable :: buf(:, :, :)
      integer :: n, ny
      n = ubound(this%local,1)
      ny = size(this%local,3)
@@ -204,12 +210,13 @@
      end if

      !dir$ pgas defer_sync
-      this%halo_west_in(1:halo_size,:,1:ny)[east_neighbor] = this%local(n-halo_size*2+1:n-halo_size,:,:)
+      buf = this%local(n-halo_size*2+1:n-halo_size,:,:)
+      this%halo_west_in(1:halo_size,:,1:ny)[east_neighbor] = buf
  end subroutine

  module subroutine put_west(this)
      class(exchangeable_t), intent(inout) :: this
-
+      real, allocatable :: buf(:, :, :)
      integer :: start, ny
      start = lbound(this%local,1)
      ny = size(this%local,3)
@@ -222,27 +229,30 @@
      end if

      !dir$ pgas defer_sync
-      this%halo_east_in(1:halo_size,:,1:ny)[west_neighbor] = this%local(start+halo_size:start+halo_size*2-1,:,:)
+      buf = this%local(start+halo_size:start+halo_size*2-1,:,:)
+      this%halo_east_in(1:halo_size,:,1:ny)[west_neighbor] = buf
  end subroutine

  module subroutine retrieve_east_halo(this)
      class(exchangeable_t), intent(inout) :: this
-
+      real, allocatable :: buf(:, :, :)
      integer :: n, ny
      n = ubound(this%local,1)
      ny = size(this%local,3)

-      this%local(n-halo_size+1:n,:,:) = this%halo_east_in(1:halo_size,:,1:ny)
+      buf = this%halo_east_in(1:halo_size,:,1:ny)
+      this%local(n-halo_size+1:n,:,:) = buf
  end subroutine

  module subroutine retrieve_west_halo(this)
      class(exchangeable_t), intent(inout) :: this
-
+      real, allocatable :: buf(:, :, :)
      integer :: start, ny
      start = lbound(this%local,1)
      ny = size(this%local,3)

-      this%local(start:start+halo_size-1,:,:) = this%halo_west_in(1:halo_size,:,1:ny)
+      buf = this%halo_west_in(1:halo_size,:,1:ny)
+      this%local(start:start+halo_size-1,:,:) = buf
  end subroutine

On a 16 physical, 32 logical cores machine (so timing 36 is irrelevant here)

Images Duration
9 37.150s
18 30.003s
27 20.916s
36 25.165s

t-bltg avatar Jul 03 '18 19:07 t-bltg

yes, don't use 36 there, 8 and 16 would be of interest, but I think the problem only appears when you invoke MPI calls across multiple nodes.

With the patch above applied (and -Ofast):

Images Runtime (new) Runtime (original)
36 15.2 14.7
72 99.4 105

I suspect that differences are not statistically significant.

gutmann avatar Jul 03 '18 19:07 gutmann

@zbeekman have you tried reproducing this on another machine? I can dig around with different MPI implementations on cheyenne if it turns out this is some how specific to my installation on cheyenne, but I haven't had this kind of speed issue there with past installations.

gutmann avatar Jul 10 '18 17:07 gutmann

Sorry no bandwidth to spare at the moment. Hope to look later this week. On Tue, Jul 10, 2018 at 1:37 PM Ethan Gutmann [email protected] wrote:

@zbeekman https://github.com/zbeekman have you tried reproducing this on another machine? I can dig around with different MPI implementations on cheyenne if it turns out this is some how specific to my installation on cheyenne, but I haven't had this kind of speed issue there with past installations.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/sourceryinstitute/OpenCoarrays/issues/560#issuecomment-403906145, or mute the thread https://github.com/notifications/unsubscribe-auth/AAREPAih5-YOyvq_UMkuk9gttbLoomT-ks5uFOZqgaJpZM4VBcYD .

zbeekman avatar Jul 10 '18 17:07 zbeekman

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale[bot] avatar Mar 29 '19 09:03 stale[bot]

Be gone stalebot!

zbeekman avatar Mar 29 '19 12:03 zbeekman