OpenCoarrays
OpenCoarrays copied to clipboard
Defect: Performance issue on multiple nodes
Defect/Bug Report
Performance issue on multiple nodes. I thought I'd add a comment here on a performance issue I'm seeing at the moment using coarrays in derived types. I haven't chased down anything more specific, but using the coarray-icar test-ideal, and a small (200 x 200 x 20) problem size, I'm seeing huge slow downs across multiple nodes. This was not present in opencoarrays 1.9.1 (with gcc 6.3) and it is not present with intel.
I initially thought this could be related to #556 but now think this is completely separate since it is internode communication and thus will require MPI calls.
Coarray_icar uses non-blocking sends on allocatable coarrays embedded within derived types. It first processes halo grid cells, then sends them, then processes internal grid cells, then syncs with it's neighbors before reading the halos that were sent to it.
- OpenCoarrays Version:
- Fortran Compiler: gfortran 8.1
- C compiler used for building lib: gcc 8.1
- Installation method:
install.sh
- Output of
uname -a
:Linux cheyenne1 3.12.62-60.64.8-default #1 SMP Tue Oct 18 12:21:38 UTC 2016 (42e0a66) x86_64 x86_64 x86_64 GNU/Linux
- MPI library being used: MPICH 3.2.1
- Machine architecture and number of physical cores: SGI Xeons+infiniband 36 cores / node cheyenne
- Version of CMake: 3.9.1
Observed Behavior
Performance gets worse when multiple nodes are used
Expected Behavior
Performance gets better when multiple nodes are used
Steps to Reproduce
git clone https://github.com/gutmann/coarray_icar
cd coarray_icar/src/tests
make MODE=fast
# edit input_parameters
cat > input-parameters.txt <<EOF
&grid nx=200,ny=200,nz=20 /
EOF
# get access to multiple nodes
for (( i=36; i<=144; i+=36 )); do
cafrun -n $i ./test-ideal | grep "Model run time"
done
Example results for Opencoarrays 2.1, 1.9.1 and intel (more details on each below)
Images | OpenCoarrays 2.1 | OpenCoarrays 1.9.1 | Intel 18.0.1 |
---|---|---|---|
36 | 14.7 | 16.3 | 11.3 |
72 | 105 | 8.9 | 5.8 |
144 | 140 | 4.6 | 3.2 |
720 | 170 | 1.4 | 0.94 |
All times in seconds. This is just the core runtime of the numerics not any of the initialization time.
OpenCoarrays 2.1
gfortran/gcc 8.1 (built via opencoarrays install.sh)
mpich 3.2.1 (built via opencoarrays install.sh)
opencoarrays 2.1.0-31-gc0e3ffb (with fprintf statements commented out in mpi/mpi.c)
OpenCoarrays 1.9.1
gfortran/gcc 6.3
MPT 2.15f
opencoarrays 1.9.1
Intel 18.0.1
ifort 18.0.1*
iMPI 2018.1.163
Note, I have logged into the remote nodes when the code is running and I see the process correctly running on all remote nodes, so there isn't anything weird going on wherein it is trying to run all images on one node or anything.
Hi Ethan, please make to sure to use -O0 while compiling with Intel and GFortran.
On Tue, Jul 3, 2018 at 12:42 PM Ethan Gutmann [email protected] wrote:
Note, I have logged into the remote nodes when the code is running and I see the process correctly running on all remote nodes, so there isn't anything weird going on wherein it is trying to run all images on one node or anything.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sourceryinstitute/OpenCoarrays/issues/560#issuecomment-402255598, or mute the thread https://github.com/notifications/unsubscribe-auth/AIDO19o8R55wo_0AjH0Ut1iVrGS5IyX5ks5uC7sPgaJpZM4VBcYD .
--
Alessandro Fanfarillo
please make to sure to use -O0 while compiling with Intel and GFortran.
@afanfa just out of curiosity, why?
Thanks @afanfa, make MODE=fast
compiles with caf -fopenmp -lgomp -Ofast -cpp -DUSE_ASSERTIONS=.false.
. I am setting OMP_NUM_THREADS=1
in the tests above to disable OpenMP threading, though I suppose it could still be doing something behind the scenes. Testing without openMP enabled at all now (and -O0 instead of -Ofast)
With -O0
and no openMP directives:
Images | Runtime |
---|---|
36 | 26 |
72 | 116 |
Intel is much better than GFortran in optimizing code (I'm referring to computation, not communication). If the code has even a small part of computation Intel will almost always produce better results when the same optimization level is applied (unless it's -O0)
On Tue, Jul 3, 2018 at 1:00 PM zbeekman [email protected] wrote:
please make to sure to use -O0 while compiling with Intel and GFortran.
@afanfa https://github.com/afanfa just out of curiosity, why?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sourceryinstitute/OpenCoarrays/issues/560#issuecomment-402260712, or mute the thread https://github.com/notifications/unsubscribe-auth/AIDO1-sYhLs53Hhh0YjPW2ZaqJkvBU2jks5uC79PgaJpZM4VBcYD .
--
Alessandro Fanfarillo
Yes, I don't care too much about the relative performance of intel vs gfortran, I'm more interested in the scaling of the two (using 2x more cores with intel takes 1/2 the time, but with gfortran+opencoarrays it takes ~>5x longer here)
@afanfa don't worry I suspect the performance regression happened in PR #427 or other "recent" changes and not in code you wrote 😉
@gutmann I fixed your first table and deleted the (now extraneous) comment with the corrected formatting; the issue was you need a line of whitespace before tables.
Thanks Zaak. ;-)
On Tue, Jul 3, 2018 at 1:08 PM zbeekman [email protected] wrote:
@afanfa https://github.com/afanfa don't worry I suspect the performance regression happened in PR #427 https://github.com/sourceryinstitute/OpenCoarrays/issues/427 and not in code you wrote 😉
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sourceryinstitute/OpenCoarrays/issues/560#issuecomment-402262975, or mute the thread https://github.com/notifications/unsubscribe-auth/AIDO13MtabemK_E00nWsjODsfA9LeR_Qks5uC8E8gaJpZM4VBcYD .
--
Alessandro Fanfarillo
Sorry I cannot test the scalability with this many cores.
I'm pretty sure this is NOT it, but can you try the following patch ?
--- exchangeable_implementation_.f90 2018-07-03 21:14:56.723202212 +0200
+++ exchangeable_implementation.f90 2018-07-03 21:00:28.527662543 +0200
@@ -138,6 +138,7 @@
module subroutine put_north(this)
class(exchangeable_t), intent(inout) :: this
+ real, allocatable :: buf(:, :, :)
integer :: n, nx
n = ubound(this%local,3)
nx = size(this%local,1)
@@ -150,12 +151,13 @@
end if
!dir$ pgas defer_sync
- this%halo_south_in(1:nx,:,1:halo_size)[north_neighbor] = this%local(:,:,n-halo_size*2+1:n-halo_size)
+ buf = this%local(:,:,n-halo_size*2+1:n-halo_size)
+ this%halo_south_in(1:nx,:,1:halo_size)[north_neighbor] = buf
end subroutine
module subroutine put_south(this)
class(exchangeable_t), intent(inout) :: this
-
+ real, allocatable :: buf(:, :, :)
integer :: start, nx
start = lbound(this%local,3)
nx = size(this%local,1)
@@ -167,31 +169,35 @@
"put_south: conformable halo_north_in and local " )
end if
!dir$ pgas defer_sync
- this%halo_north_in(1:nx,:,1:halo_size)[south_neighbor] = this%local(:,:,start+halo_size:start+halo_size*2-1)
+ buf = this%local(:,:,start+halo_size:start+halo_size*2-1)
+ this%halo_north_in(1:nx,:,1:halo_size)[south_neighbor] = buf
end subroutine
module subroutine retrieve_north_halo(this)
class(exchangeable_t), intent(inout) :: this
-
+ real, allocatable :: buf(:, :, :)
integer :: n, nx
n = ubound(this%local,3)
nx = size(this%local,1)
- this%local(:,:,n-halo_size+1:n) = this%halo_north_in(:nx,:,1:halo_size)
+ buf = this%halo_north_in(:nx,:,1:halo_size)
+ this%local(:,:,n-halo_size+1:n) = buf
end subroutine
module subroutine retrieve_south_halo(this)
class(exchangeable_t), intent(inout) :: this
-
+ real, allocatable :: buf(:, :, :)
integer :: start, nx
start = lbound(this%local,3)
nx = size(this%local,1)
- this%local(:,:,start:start+halo_size-1) = this%halo_south_in(:nx,:,1:halo_size)
+ buf = this%halo_south_in(:nx,:,1:halo_size)
+ this%local(:,:,start:start+halo_size-1) = buf
end subroutine
module subroutine put_east(this)
class(exchangeable_t), intent(inout) :: this
+ real, allocatable :: buf(:, :, :)
integer :: n, ny
n = ubound(this%local,1)
ny = size(this%local,3)
@@ -204,12 +210,13 @@
end if
!dir$ pgas defer_sync
- this%halo_west_in(1:halo_size,:,1:ny)[east_neighbor] = this%local(n-halo_size*2+1:n-halo_size,:,:)
+ buf = this%local(n-halo_size*2+1:n-halo_size,:,:)
+ this%halo_west_in(1:halo_size,:,1:ny)[east_neighbor] = buf
end subroutine
module subroutine put_west(this)
class(exchangeable_t), intent(inout) :: this
-
+ real, allocatable :: buf(:, :, :)
integer :: start, ny
start = lbound(this%local,1)
ny = size(this%local,3)
@@ -222,27 +229,30 @@
end if
!dir$ pgas defer_sync
- this%halo_east_in(1:halo_size,:,1:ny)[west_neighbor] = this%local(start+halo_size:start+halo_size*2-1,:,:)
+ buf = this%local(start+halo_size:start+halo_size*2-1,:,:)
+ this%halo_east_in(1:halo_size,:,1:ny)[west_neighbor] = buf
end subroutine
module subroutine retrieve_east_halo(this)
class(exchangeable_t), intent(inout) :: this
-
+ real, allocatable :: buf(:, :, :)
integer :: n, ny
n = ubound(this%local,1)
ny = size(this%local,3)
- this%local(n-halo_size+1:n,:,:) = this%halo_east_in(1:halo_size,:,1:ny)
+ buf = this%halo_east_in(1:halo_size,:,1:ny)
+ this%local(n-halo_size+1:n,:,:) = buf
end subroutine
module subroutine retrieve_west_halo(this)
class(exchangeable_t), intent(inout) :: this
-
+ real, allocatable :: buf(:, :, :)
integer :: start, ny
start = lbound(this%local,1)
ny = size(this%local,3)
- this%local(start:start+halo_size-1,:,:) = this%halo_west_in(1:halo_size,:,1:ny)
+ buf = this%halo_west_in(1:halo_size,:,1:ny)
+ this%local(start:start+halo_size-1,:,:) = buf
end subroutine
On a 16 physical, 32 logical cores machine (so timing 36 is irrelevant here)
Images | Duration |
---|---|
9 | 37.150s |
18 | 30.003s |
27 | 20.916s |
36 | 25.165s |
yes, don't use 36 there, 8 and 16 would be of interest, but I think the problem only appears when you invoke MPI calls across multiple nodes.
With the patch above applied (and -Ofast):
Images | Runtime (new) | Runtime (original) |
---|---|---|
36 | 15.2 | 14.7 |
72 | 99.4 | 105 |
I suspect that differences are not statistically significant.
@zbeekman have you tried reproducing this on another machine? I can dig around with different MPI implementations on cheyenne if it turns out this is some how specific to my installation on cheyenne, but I haven't had this kind of speed issue there with past installations.
Sorry no bandwidth to spare at the moment. Hope to look later this week. On Tue, Jul 10, 2018 at 1:37 PM Ethan Gutmann [email protected] wrote:
@zbeekman https://github.com/zbeekman have you tried reproducing this on another machine? I can dig around with different MPI implementations on cheyenne if it turns out this is some how specific to my installation on cheyenne, but I haven't had this kind of speed issue there with past installations.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/sourceryinstitute/OpenCoarrays/issues/560#issuecomment-403906145, or mute the thread https://github.com/notifications/unsubscribe-auth/AAREPAih5-YOyvq_UMkuk9gttbLoomT-ks5uFOZqgaJpZM4VBcYD .
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Be gone stalebot!