flang icon indicating copy to clipboard operation
flang copied to clipboard

[regression] TeaLeaf_ref: intermittent crash when doing a call by a pointer to a procedure from an OpenMP parallel section

Open pawosm-arm opened this issue 3 years ago • 1 comments

This regression was caused by some of the patches introduced for helping with CP2K compilation crashes. These patches were merged more than a year ago:

  1. 82683130aeebdbe6d5c70044cd79ef22e54f4f22 "Fix issue with procedure pointers", Tue Nov 26 14:54:45 2019
  2. 9dd4ed032b4a2eaa3c6a31aacd4cae92695c22d5 "Fix a segfault with procedure dummy arguments", Thu Nov 7 08:15:06 2019
  3. 4498b2b61bd7363f032474e55800180d5a7c0134 "Fix issue identified in large application", Wed Nov 6 14:06:55 2019
  4. bfb6bb22e3713e642f08f12325fc031680f84f0b "Support default procedure pointer initialization", Fri Oct 25 17:08:08 2019

Although these patches helped to solve one of many CP2K compilation issues, they have introduced a regression that manifests itself only in OpenMP parallel regions with the number of running OpenMP threads greater than 1. With recent introduction of OpenMP-only compilation variant in CP2K, these patches are potentially harmful to CP2K too (therefore, by fixing an compile-time issue they introduced a potential for a runtime crash).

One of the affected Fortran applications is TeaLeaf_ref (see https://github.com/UK-MAC/TeaLeaf_refhttps://github.com/UK-MAC/TeaLeaf_ref). In its pack_kernel.f90 file it has pack_all subroutine which calls another procedure by pack_func pointer which is set to point at a particular subroutine few lines above. Trouble is, the pack_all subroutine itself is called from a OpenMP parallel region in the call_packing_functions subroutine defined in the pack.f90 file. After the introduction of the patches listed above, following new element (a global static variable) started to appear in the generated LLVM-IR:

%struct.STATICS4 = type <{ [88 x i8] , i8*  }>
@.STATICS4 = internal global %struct.STATICS4 <{ [88 x i8]  [i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,
i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i
8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0,i8 0],  i8*  null }>, align 16, !dbg !125

Further on in this IR, read and write accesses to this variable are performed in the pack_all subroutine, without any locking. During the execution, sooner or later it results in an attempt to either calling a wrong subroutine or to call to the NULL pointer which results in a segfault like this one:

$ OMP_NUM_THREADS=10 mpirun -np 3 ./tea_leaf

   Tea Version    1.403
       MPI Version
    OpenMP Version
   Task Count      3
 Thread Count:    10


   Tea Version    1.403
       MPI Version
    OpenMP Version
   Task Count      3
 Thread Count:    10

 Output file tea.out opened. All output will go there.
[vcn-man-comp:87373] *** Process received signal ***
[vcn-man-comp:87372] *** Process received signal ***
[vcn-man-comp:87373] Signal: Segmentation fault (11)
[vcn-man-comp:87373] Signal code: Address not mapped (1)
[vcn-man-comp:87373] Failing at address: (nil)
[vcn-man-comp:87372] Signal: Segmentation fault (11)
[vcn-man-comp:87372] Signal code: Address not mapped (1)
[vcn-man-comp:87372] Failing at address: (nil)
[vcn-man-comp:87373] [ 0] [0xffff8152066c]
[vcn-man-comp:87373] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node localhost exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

This crash could be easily reproduced on AArch64 and x86_64 machines.

SOME WORKAROUND: I've slightly modified the IR by adding the thread_local attribute to the .STATICS4 global variable. After compiling it and linking into the final executable binary I've observed that this indeed resulted in an successful execution of the tea_leaf with many threads.

POTENTIAL DRAWBACKS OF THIS WORKAROUND (NEEDS INVESTIGATION): It is not certain if such approach is correct in every case. What if the pack_all has its own OpenMP parallel region in which pack_func is accessed? Should a change to the pack_func pointer made in this function before entering the parallel region be propagated to all of the other threads? What if the pack_func is listed in PRIVATE or SHARED clauses? Will it be handled with necessary care?

Note that TeaLeaf_ref is an MPI application. We've been testing it with OpenMPI version 4.0.5. The affected code is executed when there are more than two MPI processes started, e.g. by invoking OMP_NUM_THREADS=2 mpirun -np 3 ./tea_leaf. We were testing TeaLeaf_ref from the feature/armcompiler branch (see https://github.com/UK-MAC/TeaLeaf_ref/tree/feature/armcompiler). Both AArch64 and x86_64 tests were built from this branch.

pawosm-arm avatar Apr 06 '21 20:04 pawosm-arm

I've found some time to look into this. The following line in the pack_kernel.f90 file causes problem:

   PROCEDURE(pack_or_unpack), POINTER :: pack_func => NULL()

As I mentioned in my previous comments, after introduction of the commits I've listed above, a static global variable is created for the pack_func variable. This is clearly not safe for multi-threaded OpenMP programs and it results in a race condition. One way to avoid creation of a static variable is to rewrite this code as such:

  PROCEDURE(pack_or_unpack), POINTER :: pack_func

  NULLIFY(pack_func)

Other way is to make one of the changes introduced by those listed commits ineffective for OpenMP compilation mode, namely:

--- a/tools/flang1/flang1exe/semant.c
+++ b/tools/flang1/flang1exe/semant.c
@@ -11357,10 +11357,15 @@ proc_dcl_init:
     sptr = refsym(sptr, OC_OTHER);
     SST_SYMP(RHS(3), sptr);
     SST_IDP(RHS(3), S_IDENT);
-    sem.dinit_data = TRUE;
-    (void)mkvarref(RHS(3), ITEM_END);
-    sem.dinit_data = FALSE;
-    inited = TRUE;
+    if (flg.smp) {
+      sem.dinit_data = FALSE;
+      inited = FALSE;
+    } else {
+      sem.dinit_data = TRUE;
+      (void)mkvarref(RHS(3), ITEM_END);
+      sem.dinit_data = FALSE;
+      inited = TRUE;
+    }

   proc_dcl_shared:
     sptr = SST_SYMG(RHS(1));

Note that this fix isn't perfect, but it seems to be good enough.

pawosm-arm avatar May 24 '22 20:05 pawosm-arm