PSyclone icon indicating copy to clipboard operation
PSyclone copied to clipboard

Loop fusing and openmp transformations generating incorrect code

Open thomasmelvin opened this issue 1 year ago • 3 comments

LFRic https://code.metoffice.gov.uk/trac/lfric_apps/ticket/254 replaces a group of bespoke psykal-lite routines for some built-ins with psyclone generated code using loop fusing transformations. However, during testing if that change is was found that the combination of loop fusing and openmp transform produced code that looks incorrect for multiple threads and infact cause the model to fail even for only a single thread. The solution current choosen is to remove the openmp transform from this algorithm but in the long run we would like to run with openmp threading and more loop fusing. Given the algorithm level code

   call invoke( name="test_psyclone_transforms" , &
                a_times_X(field_np1, dt, field_n), & 
                inc_X_plus_Y(field_np1, field_n) )

and the optimisation script

   from __future__ import absolute_import, print_function
   from psyclone.domain.lfric.transformations import LFRicLoopFuseTrans
   from psyclone.transformations import TransformationError
   from psyclone_tools import (redundant_computation_setval, colour_loops,
                               openmp_parallelise_loops,
                               view_transformed_schedule)
   def fuse_loops(psy):

       '''PSyclone transformation script for the Dynamo0.3 API to apply loop
       fusion generically to all top level loops.

       '''
       total_fused = 0
       lf_trans = LFRicLoopFuseTrans()

       # Loop over all of the Invokes in the PSy object
       for invoke in psy.invokes.invoke_list:

           local_fused = 0
           schedule = invoke.schedule

           # Loop over all nodes in reverse order
           idx = len(schedule.children) - 1
           while idx > 0:
               node = schedule.children[idx]
               prev_node = schedule.children[idx-1]
               try:
                   if node.iteration_space == "dof":
                       try:
                           lf_trans.apply(prev_node, node, {"same_space": True})
                           local_fused += 1
                       except TransformationError:
                          pass
               except AttributeError:
                   pass
               idx -= 1
           total_fused += local_fused
           if local_fused > 0:
               print("After fusing ...")
               print(schedule.view())

       print(f"Fused {total_fused} loops")

   def trans(psy):

       redundant_computation_setval(psy)
       fuse_loops(psy)
       colour_loops(psy)
       openmp_parallelise_loops(psy)
       view_transformed_schedule(psy)

       return psy

The generated psy layer code is

   SUBROUTINE invoke_test_psyclone_transforms(field_np1, dt, field_n)
      USE mesh_mod, ONLY: mesh_type
      REAL(KIND=r_tran), intent(in) :: dt
      TYPE(r_tran_field_type), intent(in) :: field_np1, field_n
      INTEGER df
      INTEGER(KIND=i_def) loop0_start, loop0_stop
      REAL(KIND=r_tran), pointer, dimension(:) :: field_n_data => null()
      REAL(KIND=r_tran), pointer, dimension(:) :: field_np1_data => null()
      TYPE(r_tran_field_proxy_type) field_np1_proxy, field_n_proxy
      INTEGER(KIND=i_def) max_halo_depth_mesh
      TYPE(mesh_type), pointer :: mesh => null()
      !
      ! Initialise field and/or operator proxies
      !
      field_np1_proxy = field_np1%get_proxy()
      field_np1_data => field_np1_proxy%data
      field_n_proxy = field_n%get_proxy()
      field_n_data => field_n_proxy%data
      !
      ! Create a mesh object
      !
      mesh => field_np1_proxy%vspace%get_mesh()
      max_halo_depth_mesh = mesh%get_halo_depth()
      !
      ! Set-up all of the loop bounds
      !
      loop0_start = 1
      loop0_stop = field_np1_proxy%vspace%get_last_dof_annexed()
      !
      ! Call kernels and communication routines
      !
      !$omp parallel default(shared), private(df,field_np1_data)
      !$omp do schedule(static)
      DO df=loop0_start,loop0_stop
        field_np1_data(df) = dt * field_n_data(df)
        field_np1_data(df) = field_np1_data(df) + field_n_data(df)
      END DO
      !$omp end do
      !$omp end parallel
      !
      ! Set halos dirty/clean for fields modified in the above loop(s)
      !
      CALL field_np1_proxy%set_dirty()
      !
      ! End of set dirty/clean section for above loop(s)
      !
      !
    END SUBROUTINE invoke_test_psyclone_transforms

and the issue is that field_np1_data should not be included in the private statement for the omp parallel directive (since it is the global field array and hence needs to be shared). If this is removed (either manually via a psykal-lite approach, or by turning off the openmp transforms then the code runs happily.

thomasmelvin avatar Aug 29 '24 09:08 thomasmelvin

Thanks for reporting this @thomasmelvin. I think we generate the clauses in the backend so there must be a bug there. @sergisiso do you have any comments as I know you're altering the code-generation for LFRic at the moment?

arporter avatar Aug 29 '24 14:08 arporter

The current logic for OpenMP sharing clauses should set all arrays as shared-default. I believe this will be an issue with pointers which we recently added support in the intermediate representation and probably pointer arrays are not properly reported as arrays. I am surprised this was not catch by our integration tests, I will have a look.

sergisiso avatar Aug 30 '24 09:08 sergisiso

The reason this is wrong is because the type of field data values is wrong in the PSyIR symbol, but the gen_code is hardcoded to generate the right declaration.

This is already solved in my branch that switches to use the new backend, as there the type is not hardcoded as must match to get the right declaration. With that branch I get:

    ! Call kernels and communication routines
    !$omp parallel default(shared), private(df)
    !$omp do schedule(static)
    do df = loop0_start, loop0_stop, 1
      ! Built-in: a_times_X (copy a scaled real-valued field)
      field_np1_data(df) = dt * field_n_data(df)

      ! Built-in: inc_X_plus_Y (increment a real-valued field)
      field_np1_data(df) = field_np1_data(df) + field_n_data(df)
    enddo
    !$omp end do
    !$omp end parallel

I hope to get this branch into review soon, but if not I will backport this fix.

sergisiso avatar Sep 17 '24 07:09 sergisiso

@sergisiso, @arporter is there a PR for this if it's required for PSyclone 3.0.0?

TeranIvy avatar Dec 04 '24 15:12 TeranIvy

Sergi has had a look and it is non-trivial so we will have to postpone the fix given the timescale for 3.0. If that's ok then I suggest we remove the 'release' label from this.

arporter avatar Dec 04 '24 16:12 arporter

Sergi has had a look and it is non-trivial so we will have to postpone the fix given the timescale for 3.0. If that's ok then I suggest we remove the 'release' label from this.

That's fine, @arporter 😃 We can live with PSyKAl-lite code for some time as we will have our hands full with porting for the next couple of months.

TeranIvy avatar Dec 06 '24 10:12 TeranIvy

I've just tested this with master and now I get:

    ! Call kernels and communication routines
    !$omp parallel do default(shared) private(df) schedule(static)
    do df = loop0_start, loop0_stop, 1
      ! Built-in: a_times_X (copy a scaled real-valued field)
      field_np1_data(df) = dt * field_n_data(df)

      ! Built-in: inc_X_plus_Y (increment a real-valued field)
      field_np1_data(df) = field_np1_data(df) + field_n_data(df)
    enddo
    !$omp end parallel do

therefore I'll close this issue.

arporter avatar Jul 31 '25 14:07 arporter