OpenCoarrays icon indicating copy to clipboard operation
OpenCoarrays copied to clipboard

Defect: src/tests/unit/simple/test1Caf.F90 AKA increment_my_neighbor fails

Open zbeekman opened this issue 7 years ago • 5 comments

Defect/Bug Report

src/tests/unit/simple/test1Caf.F90 AKA increment_my_neighbor fails (at least when oversubscribed @ 32 cores)

I have spent some time looking at this and can't convince myself that this is not a logic error in the test itself. So it could be a bug in the test or a bug in the library, likely due to a race condition, if that's the case.

  • OpenCoarrays Version: 1.9.0-5-g232d234
  • Fortran Compiler: GFortran 7.1
  • C compiler used for building lib: GCC 7.1
  • Installation method: FC=gfortran-7 CC=gcc-7 cmake ..
  • Output of uname -a: Darwin IBBs-MBP.local 14.5.0 Darwin Kernel Version 14.5.0: Tue Apr 11 16:12:42 PDT 2017; root:xnu-2782.50.9.2.3~1/RELEASE_X86_64 x86_64
  • MPI library being used: MPICH 3.2
  • Machine architecture and number of physical cores: Intel_64 @ 4 cores
  • Version of CMake: 3.8.2

Observed Behavior

Test fails when oversubscribed at 32 images

Expected Behavior

Test passes

Steps to Reproduce

Uncomment relevant line in CMakeLists.txt, L568 currently.

zbeekman avatar Jun 20 '17 18:06 zbeekman

I have noticed several tests failing when over prescribed. Failing in the sense of running forever and I manually terminated. This could be because the test make no sense at higher numbers of processes. I will post what I am seeing later.

jerryd avatar Jun 21 '17 00:06 jerryd

Some tests require a specific number such as power of two

zbeekman avatar Jun 21 '17 00:06 zbeekman

I have been running some tests on this case. With -np 4 I can get a fail about every 100 runs or so. The fail rate goes up with increased -np.

I can eliminate failures by inserting a 'sync all' here: me = this_image() np = num_images()

sync all

left = merge(np,me-1,me==1) right = merge(1,me+1,me==np)

jerryd avatar Jun 23 '17 03:06 jerryd

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale[bot] avatar Mar 29 '19 09:03 stale[bot]

Building with (MPICH 3.3, GCC/GFortran 8.3 all from mac Homebrew):

export FC="$(which gfortran-8)"
export CC="$(which gcc-8)"
cmake -Wdev -DCMAKE_BUILD_TYPE:STRING=Debug -DCMAKE_Fortran_FLAGS:STRING="-g -fbacktrace -fcheck=bounds,pointer" -DCMAKE_C_FLAGS:STRING="-g -fstack-check" ..
make -j

and then testing with:

bin/cafrun -np 50 bin/OpenCoarrays-2.6.1-11-g84ea96a-tests/increment_my_neighbor

Reliably causes failures on my work iMac

intel core i-5 4690 @ 3.5 GHz, 4 cores, 4 threads.

I have attached a full debug log of the runtime failure. increment_my_neighbor.50img.fail.txt

zbeekman avatar Mar 29 '19 16:03 zbeekman