PSyclone icon indicating copy to clipboard operation
PSyclone copied to clipboard

Support pushing a loop into a subroutine in the PSyIR

Open rupertford opened this issue 4 years ago • 4 comments

NVIDIA are interested in whether PSyclone would be able to push Fortran loops into subroutines to improve the performance of particular codes. The problem they have is where a code loops over a large amount of work (and calls subroutines) which contains many scalars or arrays that are local to this loop. This can mean that they end up with too many variables and they run out of registers causing poor performance. Their solution is to increase dimensionality of this temporary loop variables creating more smaller loops (with higher dimension arrays) and in-particular pushing loops into subroutines. For example ...

do i=1,n
  tmp1(j) = a(i,j)
  call work(tmp1, tmp2)
  b(i,j) = tmp2(j)
end do

could become

do i=1,n
  tmp1(i,j) = a(i,j)
end do
call work(tmp1,tmp2)
do i=1,n
  b(i,j) = tmp2(i,j)
end do

with the work subroutine being updated appropriately as well.

rupertford avatar Jul 01 '21 23:07 rupertford

The first step would be to look at all variables within a specified loop, find the ones that do not depend on the loop index and make them depend on the loop index by adding a dimension (when they are modified).

A simplification would be to only allow this to happen if the associated variables were only accessed within the loop. If accessed outside the loop then another transformation would be required.

I think we start by only allowing changes if a variable is only accessed within the loop.

rupertford avatar Jul 01 '21 23:07 rupertford

Created branch 1322_lower_loop

rupertford avatar Jul 02 '21 00:07 rupertford

Just FYI: I have already code that verifies for all variables if a loop variable is used consistently (i.e. in the same dimension). It would be a trivial change to find all variables that are not using the loop variable as indices. ATM it's on a branch. This function tests that all access to one variable use the loop indices consistently:

https://github.com/stfc/PSyclone/blob/8f635b7c9c8f7141bf41cd86933ede713155bafb/src/psyclone/psyir/tools/dependency_tools.py#L113

A corresponding outer loop will be something along the lines:

var_accesses = VariablesAccessInfo([node1, node2, ... ])
for signature in var_accesses.all_signatures:
            if signature ==  loop_var_signature:
                continue   # ignore loop variable
            var_info = var_accesses[signature]
            symbol_table = loop.scope.symbol_table   # some psyir node giving access to symboltable
            symbol = symbol_table.lookup(var_name)
            # TODO #1270 - the is_array_access function might be moved
            is_array = symbol.is_array_access(access_info=var_info)
            if is_array:
               use function as above to check if loop variable is used
            else:
               scalar ... turn into array?

Note that this is still kind of WIP

hiker avatar Jul 03 '21 03:07 hiker

I've hit this problem while attempting to port the NEMO (v5.xxx) sea-ice routines to GPU. The loops have a lot going on inside them and often there are arrays that need to be 'privatised'. The obvious solution to this is to increase their dimensionality so that each loop iteration has its own member. e.g. in icethd_dh I've got:

    do ji = 1, npti, 1
      zq_top = MAX(0._wp, qml_ice_1d(ji) * rDt_ice)
      zf_tt = qcn_ice_bot_1d(ji) + qsb_ice_bot_1d(ji) + fhld_1d(ji) + qtr_ice_bot_1d(ji) * frq_m_1d(ji)
      zq_bot = MAX(0._wp, zf_tt * rDt_ice)
      if (nn_icesal == 4) then
        zs_i(:,ji) = sz_i_1d(ji,:)
      else
        zs_i(:,ji) = s_i_1d(ji)
      end if
      zs_i_old(0:nlay_i + 1,ji) = 0._wp
      ze_i_old(0:nlay_i + 1,ji) = 0._wp
      zh_i_old(0:nlay_i + 1,ji) = 0._wp
      zh_i(0:nlay_i + 1,ji) = 0._wp
      do jk = 1, nlay_i, 1
        zs_i_old(jk,ji) = h_i_1d(ji) * r1_nlay_i * zs_i(jk,ji)
        ze_i_old(jk,ji) = h_i_1d(ji) * r1_nlay_i * e_i_1d(ji,jk)
        zh_i_old(jk,ji) = h_i_1d(ji) * r1_nlay_i
        zh_i(jk,ji) = h_i_1d(ji) * r1_nlay_i
      enddo

where I had to increase the dimensionality of the z?_i_old and zh_i arrays in order to make the outer loop over ji parallel.

arporter avatar Mar 07 '24 11:03 arporter