Support pushing a loop into a subroutine in the PSyIR
NVIDIA are interested in whether PSyclone would be able to push Fortran loops into subroutines to improve the performance of particular codes. The problem they have is where a code loops over a large amount of work (and calls subroutines) which contains many scalars or arrays that are local to this loop. This can mean that they end up with too many variables and they run out of registers causing poor performance. Their solution is to increase dimensionality of this temporary loop variables creating more smaller loops (with higher dimension arrays) and in-particular pushing loops into subroutines. For example ...
do i=1,n
tmp1(j) = a(i,j)
call work(tmp1, tmp2)
b(i,j) = tmp2(j)
end do
could become
do i=1,n
tmp1(i,j) = a(i,j)
end do
call work(tmp1,tmp2)
do i=1,n
b(i,j) = tmp2(i,j)
end do
with the work subroutine being updated appropriately as well.
The first step would be to look at all variables within a specified loop, find the ones that do not depend on the loop index and make them depend on the loop index by adding a dimension (when they are modified).
A simplification would be to only allow this to happen if the associated variables were only accessed within the loop. If accessed outside the loop then another transformation would be required.
I think we start by only allowing changes if a variable is only accessed within the loop.
Created branch 1322_lower_loop
Just FYI: I have already code that verifies for all variables if a loop variable is used consistently (i.e. in the same dimension). It would be a trivial change to find all variables that are not using the loop variable as indices. ATM it's on a branch. This function tests that all access to one variable use the loop indices consistently:
https://github.com/stfc/PSyclone/blob/8f635b7c9c8f7141bf41cd86933ede713155bafb/src/psyclone/psyir/tools/dependency_tools.py#L113
A corresponding outer loop will be something along the lines:
var_accesses = VariablesAccessInfo([node1, node2, ... ])
for signature in var_accesses.all_signatures:
if signature == loop_var_signature:
continue # ignore loop variable
var_info = var_accesses[signature]
symbol_table = loop.scope.symbol_table # some psyir node giving access to symboltable
symbol = symbol_table.lookup(var_name)
# TODO #1270 - the is_array_access function might be moved
is_array = symbol.is_array_access(access_info=var_info)
if is_array:
use function as above to check if loop variable is used
else:
scalar ... turn into array?
Note that this is still kind of WIP
I've hit this problem while attempting to port the NEMO (v5.xxx) sea-ice routines to GPU. The loops have a lot going on inside them and often there are arrays that need to be 'privatised'. The obvious solution to this is to increase their dimensionality so that each loop iteration has its own member. e.g. in icethd_dh I've got:
do ji = 1, npti, 1
zq_top = MAX(0._wp, qml_ice_1d(ji) * rDt_ice)
zf_tt = qcn_ice_bot_1d(ji) + qsb_ice_bot_1d(ji) + fhld_1d(ji) + qtr_ice_bot_1d(ji) * frq_m_1d(ji)
zq_bot = MAX(0._wp, zf_tt * rDt_ice)
if (nn_icesal == 4) then
zs_i(:,ji) = sz_i_1d(ji,:)
else
zs_i(:,ji) = s_i_1d(ji)
end if
zs_i_old(0:nlay_i + 1,ji) = 0._wp
ze_i_old(0:nlay_i + 1,ji) = 0._wp
zh_i_old(0:nlay_i + 1,ji) = 0._wp
zh_i(0:nlay_i + 1,ji) = 0._wp
do jk = 1, nlay_i, 1
zs_i_old(jk,ji) = h_i_1d(ji) * r1_nlay_i * zs_i(jk,ji)
ze_i_old(jk,ji) = h_i_1d(ji) * r1_nlay_i * e_i_1d(ji,jk)
zh_i_old(jk,ji) = h_i_1d(ji) * r1_nlay_i
zh_i(jk,ji) = h_i_1d(ji) * r1_nlay_i
enddo
where I had to increase the dimensionality of the z?_i_old and zh_i arrays in order to make the outer loop over ji parallel.