Loop fusing and openmp transformations generating incorrect code
LFRic https://code.metoffice.gov.uk/trac/lfric_apps/ticket/254 replaces a group of bespoke psykal-lite routines for some built-ins with psyclone generated code using loop fusing transformations. However, during testing if that change is was found that the combination of loop fusing and openmp transform produced code that looks incorrect for multiple threads and infact cause the model to fail even for only a single thread. The solution current choosen is to remove the openmp transform from this algorithm but in the long run we would like to run with openmp threading and more loop fusing. Given the algorithm level code
call invoke( name="test_psyclone_transforms" , &
a_times_X(field_np1, dt, field_n), &
inc_X_plus_Y(field_np1, field_n) )
and the optimisation script
from __future__ import absolute_import, print_function
from psyclone.domain.lfric.transformations import LFRicLoopFuseTrans
from psyclone.transformations import TransformationError
from psyclone_tools import (redundant_computation_setval, colour_loops,
openmp_parallelise_loops,
view_transformed_schedule)
def fuse_loops(psy):
'''PSyclone transformation script for the Dynamo0.3 API to apply loop
fusion generically to all top level loops.
'''
total_fused = 0
lf_trans = LFRicLoopFuseTrans()
# Loop over all of the Invokes in the PSy object
for invoke in psy.invokes.invoke_list:
local_fused = 0
schedule = invoke.schedule
# Loop over all nodes in reverse order
idx = len(schedule.children) - 1
while idx > 0:
node = schedule.children[idx]
prev_node = schedule.children[idx-1]
try:
if node.iteration_space == "dof":
try:
lf_trans.apply(prev_node, node, {"same_space": True})
local_fused += 1
except TransformationError:
pass
except AttributeError:
pass
idx -= 1
total_fused += local_fused
if local_fused > 0:
print("After fusing ...")
print(schedule.view())
print(f"Fused {total_fused} loops")
def trans(psy):
redundant_computation_setval(psy)
fuse_loops(psy)
colour_loops(psy)
openmp_parallelise_loops(psy)
view_transformed_schedule(psy)
return psy
The generated psy layer code is
SUBROUTINE invoke_test_psyclone_transforms(field_np1, dt, field_n)
USE mesh_mod, ONLY: mesh_type
REAL(KIND=r_tran), intent(in) :: dt
TYPE(r_tran_field_type), intent(in) :: field_np1, field_n
INTEGER df
INTEGER(KIND=i_def) loop0_start, loop0_stop
REAL(KIND=r_tran), pointer, dimension(:) :: field_n_data => null()
REAL(KIND=r_tran), pointer, dimension(:) :: field_np1_data => null()
TYPE(r_tran_field_proxy_type) field_np1_proxy, field_n_proxy
INTEGER(KIND=i_def) max_halo_depth_mesh
TYPE(mesh_type), pointer :: mesh => null()
!
! Initialise field and/or operator proxies
!
field_np1_proxy = field_np1%get_proxy()
field_np1_data => field_np1_proxy%data
field_n_proxy = field_n%get_proxy()
field_n_data => field_n_proxy%data
!
! Create a mesh object
!
mesh => field_np1_proxy%vspace%get_mesh()
max_halo_depth_mesh = mesh%get_halo_depth()
!
! Set-up all of the loop bounds
!
loop0_start = 1
loop0_stop = field_np1_proxy%vspace%get_last_dof_annexed()
!
! Call kernels and communication routines
!
!$omp parallel default(shared), private(df,field_np1_data)
!$omp do schedule(static)
DO df=loop0_start,loop0_stop
field_np1_data(df) = dt * field_n_data(df)
field_np1_data(df) = field_np1_data(df) + field_n_data(df)
END DO
!$omp end do
!$omp end parallel
!
! Set halos dirty/clean for fields modified in the above loop(s)
!
CALL field_np1_proxy%set_dirty()
!
! End of set dirty/clean section for above loop(s)
!
!
END SUBROUTINE invoke_test_psyclone_transforms
and the issue is that field_np1_data should not be included in the private statement for the omp parallel directive (since it is the global field array and hence needs to be shared). If this is removed (either manually via a psykal-lite approach, or by turning off the openmp transforms then the code runs happily.
Thanks for reporting this @thomasmelvin. I think we generate the clauses in the backend so there must be a bug there. @sergisiso do you have any comments as I know you're altering the code-generation for LFRic at the moment?
The current logic for OpenMP sharing clauses should set all arrays as shared-default. I believe this will be an issue with pointers which we recently added support in the intermediate representation and probably pointer arrays are not properly reported as arrays. I am surprised this was not catch by our integration tests, I will have a look.
The reason this is wrong is because the type of field data values is wrong in the PSyIR symbol, but the gen_code is hardcoded to generate the right declaration.
This is already solved in my branch that switches to use the new backend, as there the type is not hardcoded as must match to get the right declaration. With that branch I get:
! Call kernels and communication routines
!$omp parallel default(shared), private(df)
!$omp do schedule(static)
do df = loop0_start, loop0_stop, 1
! Built-in: a_times_X (copy a scaled real-valued field)
field_np1_data(df) = dt * field_n_data(df)
! Built-in: inc_X_plus_Y (increment a real-valued field)
field_np1_data(df) = field_np1_data(df) + field_n_data(df)
enddo
!$omp end do
!$omp end parallel
I hope to get this branch into review soon, but if not I will backport this fix.
@sergisiso, @arporter is there a PR for this if it's required for PSyclone 3.0.0?
Sergi has had a look and it is non-trivial so we will have to postpone the fix given the timescale for 3.0. If that's ok then I suggest we remove the 'release' label from this.
Sergi has had a look and it is non-trivial so we will have to postpone the fix given the timescale for 3.0. If that's ok then I suggest we remove the 'release' label from this.
That's fine, @arporter 😃 We can live with PSyKAl-lite code for some time as we will have our hands full with porting for the next couple of months.
I've just tested this with master and now I get:
! Call kernels and communication routines
!$omp parallel do default(shared) private(df) schedule(static)
do df = loop0_start, loop0_stop, 1
! Built-in: a_times_X (copy a scaled real-valued field)
field_np1_data(df) = dt * field_n_data(df)
! Built-in: inc_X_plus_Y (increment a real-valued field)
field_np1_data(df) = field_np1_data(df) + field_n_data(df)
enddo
!$omp end parallel do
therefore I'll close this issue.