PSyclone (Towards #1338) task transformation and directive

A task directive needs to generate dependencies correctly at codegen time.

There are a few principles I've used to determine what these should be:

The idea of a proxy loop. When a chunked loop is used around a task directive, the loop variable for the inner loop is considered to be a proxy for the outer loop, and dependencies are created accordingly. To do this, the inner loop must have a strcutre of do i = outer_var, outer_var+outer_step, step (I think? Something like this), and then any dependencies are computed based upon the outer loop variable instead.
Array dependency access to non-proxy loop variables fully inside the task are replaced with : as they are assumed to access any/all array indices.
A binary operation access to a proxy loop variable (e.g. i+1) will instead generate multiple dependencies of i and i+outer_step to cover this dependency to give a chance to generate valid dependencies.

The rest is mostly just dependencies as expected, with some restrictions of the types of variable (i.e. no shared array access, no array access inside an array access) to avoid ridiculously complex logic to handle these cases for general purpose implementations.

I will copy some comments over as well from the old PR.

Aug 01 '22 13:08 LonelyCat124

This is dependent on #1551 and will appear as all the code from that PR is here for now.

Aug 01 '22 13:08 LonelyCat124

Codecov Report

All modified lines are covered by tests :white_check_mark:

Comparison is base (48f9f26) 99.84% compared to head (f148551) 99.84%.

Additional details and impacted files

@@           Coverage Diff            @@
##           master    #1822    +/-   ##
========================================
  Coverage   99.84%   99.84%            
========================================
  Files         344      347     +3     
  Lines       46013    46916   +903     
========================================
+ Hits        45942    46845   +903     
  Misses         71       71

Files	Coverage Δ
src/psyclone/errors.py	`100.00% <100.00%> (ø)`
src/psyclone/psyir/nodes/__init__.py	`100.00% <100.00%> (ø)`
src/psyclone/psyir/nodes/array_mixin.py	`100.00% <100.00%> (ø)`
...psyclone/psyir/nodes/dynamic_omp_task_directive.py	`100.00% <100.00%> (ø)`
src/psyclone/psyir/nodes/loop.py	`100.00% <ø> (ø)`
src/psyclone/psyir/nodes/omp_directives.py	`100.00% <100.00%> (ø)`
src/psyclone/psyir/nodes/omp_task_directive.py	`100.00% <100.00%> (ø)`
src/psyclone/psyir/transformations/__init__.py	`100.00% <100.00%> (ø)`
...c/psyclone/psyir/transformations/omp_task_trans.py	`100.00% <100.00%> (ø)`

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Aug 01 '22 13:08 codecov[bot]

Ok, looking back at this again today, and found a big oversight in my handle_binary_op strategy, which I'm unclear on the way to fix.

My assumption had been that if I had some code inside my task, e.g.:

do j=1, 320, 32
  !$omp task ....
  do i=j, j+32
    a(i) = b(i+1) - 3
  ...

was the clause(s) for the access to b could be handled by a single clause, however this is not the case. All clauses need to access a binary opreation of j and a multiple of the outer loop's step size. So in this case, to cover all values of b accessed by a single iteration, the correct clause needs to be depend(in: b(j),b(j + 32)). Right now I have no way to handle this, and it gets even more complex if there are more nested loops, e.g.

do jj=1, 320, 32
  do ii = 1, 320, 32
    !$omp task ...
    do i=ii, ii+32
      do j=jj, jj+32
       a(i, j) = b(i+1, j+1)
...

the required clauses is probably depend(in: b(i,j), b(i+32, j) b(i, j+32), b(i+32, j+32)).

The current handle_binary_op adds the new index (i.e. a BinaryOperation or Reference) to the index_list, which doesn't work when multiple are required. Instead I think the index_list needs to be a list of size N (where N is the number of indices) where each element is either:

A BinaryOperation
A Reference (or subclass)
A Literal
A list

The final set of index_lists can then be constructed from these. So if we had:

[ binop1, [binop2, binop3]]

the clauses would be: (binop1, binop2) and (binop1, binop3), or for a worse case:

[ [binop1, binop2], [binop3, binop4] ]

the clauses would be (binop1, binop3), (binop2, binop3), (binop1, binop4), (binop2, binop4).

The difficulty (from a design perspective) I have right now is I don't have a good strategy to construct all the permutations efficiently but I will continue to think about it.

Edit: Itertools.product can do this for me.

Aug 01 '22 14:08 LonelyCat124

I fixed the previous issue, and have run into another oversight I need to fix. A normal use case might be something like:

do i = 1, 320
  do j = 1, 32
    a(i,j) = b(i+1, j) + k
  enddo
enddo

or similar. A simple strategy in this case, would be to chunk the outer loop and create tasks along it:

do i = 1, 320, 32
  !$omp task firstprivate(ii), private(i, j), ....
  do ii = i, i + 32
    do j = 1, 32
      a(ii,j) = b(ii+1, j) + k
    enddo
  enddo
enddo

In this instance, ii is essentially a placeholder for the relevant value of i, so for the purposes of generating the dependencies it needs to be treated as such. I forgot about this when creating a lot of the code generation (oops) so its not taken into account I don't think. I will have to go through the code and find out a good way to handle this.

Aug 01 '22 14:08 LonelyCat124

I'm probably going to put design notes for the new ideas I have in this PR now.

Operation order:

Find all parent loops inside the same parallel region. Its not entirely clear how to do this efficiently, but probably it just has to involve recursive calls to ancestor using either Loop or OMPParallelRegion for the type argument. For each Loop, pull out the symbol and store it for later, and set it to be firstprivate.
Find all child loops inside the task region. For each of those loops, check if they are the inner section of a chunked loop structure, i.e. are they initialised as do var = parent_loop_var, .... If so, then we can store that Loop's variable as a "proxy" for the relevant parent_loop_var (i.e. when encountering this variable, it should be treated as the parent loop's variable for the purposes of dependency analysis). The loop variable itself is declared as private. Notably this is not totally straightforward if there are multiple loops using the same loop variable, this "proxy-ing" should be done on a per-loop basis, and only apply to the children of that loop.

The requirement to handle 2 correctly makes the next step more complex than the initial implementation.

We want to find all children of the node's immediate Loop child (to which the transformation was applied) which are either: Loop, Assignment, IfBlock.
For Loop nodes, we need to work out if they are a proxy loop, and then recurse through their children to find all Assignment/IfBlock/Loop nodes.
For Assignment nodes, we perform basically the same operation as now, computing all writes to shared variables into out dependencies, and all reads to shared variables as in dependencies.
For IfBlock nodes we perform the same operations as now, computing the relevant in dependencies and private/firstprivate if appropriate.

I'm hopeful that these 6 steps are sufficient to handle the full functionality needed for tasking.

Aug 01 '22 14:08 LonelyCat124

handle_binary_op at the moment does the following things, and is called only when a binary operation is found inside an array index:

If the binary operation is not an ADD or SUB, throws an exception
if the binary operation is not a LITERAL OP REFERENCE (or the opposite order) then throw an exception. This is possibly too strict (as one could do LITERAL OP LITERAL safely), but since those cases are unusual code I'm going to ignore them for now.
Find which of the children are the Reference, and store some info based on that for reconstructing the operation required for the clauses later.
There are then 2 options for how to handle this, and it depends on whether the Reference is to one of the proxy parent_loop_vars or not.

If the Reference is to a parent_loop_var proxy, we know that the Reference must be private (and thus in private_list). In this case, we need to compute the set of clauses required, based upon the Literal's value and the loop step size for the proxy variable. Note: At the moment the variable has no way to reference the loop that it is the variable to, so we should store that Loop node in the proxy system too.

If the Reference is to a loop variable that is private and not a proxy, then we need to create a depend clause to that entire array (i.e. :) for that index. If the Reference is to any other private variable, we assume its just a constant so just use a copy.

If the Reference variable is in firstprivate_list, then it currently checks if the variable is from a parent loop. If it is, then we use the same code as for when its a proxy variable, which I think is wrong. We can instead just use acopy the reference (since the parent loop variable is constant). I think this was done to handle the proxy variables from before. If it is a child loop's variable, it should throw an exception (child loop variables are not allowed to be firstprivate). In fact I think this can never happen, so its mostly a sanity check for now, as its untestable code I think. Otherwise, just use a copy of the reference.

Finally, if the variable has not yet been set private or firstprivate, we set it to firstprivate and then perform the same operations as for when its firstprivate.

If the variable is shared, then we throw an exception for now. If shared variables are used, they're in principle supportable, but I don't have a good use-case (since shared variables for now are ArrayReferences, which are banned from being used to index inside task directives).

I think I will split this function into functions to handle private and firstprivate accesses seperately.

Edit 1: A non-split version of this for read-only binary ops is now implemented.

Aug 01 '22 14:08 LonelyCat124

handle_read_only_ref at the moment does the following things:

For (ArrayReference, ArrayOfStructuresReference): At the moment, the initial implementation checks whether this ArrayReference was private in the parent parallel. This can never happen so can be removed. We thus know the Array is shared, so we loop over the indices. If an Array, AoS or StructureReference appears in the indexing we throw an exception. If the index is a Reference, handle computing the dependence & index structure accordingly, depending on if its a child loop, proxy loop or general variable. If its a binary operation then the handle_binary_op function is used. Anything else is disallowed (though I think a Literal should also be fine...). This results in a list of (lists) of indexes, and itertools.product is used to create all sets of clauses required to construct
For StructureReference: Create a reference to the symbol of the reference (for use later). Perform the same checks done for a general Reference on this new reference.
For Reference: If the variable was declared private in the parent parallel region, and is not yet in the private_list or firstprivate_list for this clause, add it to the firstprivate_list (as it is read before being written to, so we need some data). If it was not declared private in the parent parallel region, it is a shared variable, so we need to add an in dependency to this Reference.

Edit: This is implemented with its subfunctions.

Aug 01 '22 14:08 LonelyCat124

There are some uncovered loc in this PR right now, but they are implemented in #1551 so will be fixed when that is merged and this is rebased to master.

Aug 01 '22 14:08 LonelyCat124

Ok - I tested this with NemoLite2D and found a variety of issues/bugs to fix.

StructureReference need to only add the outermost member to a clause, as it should be ax and not ax%thing as the latter is not valid.
This directive doesn't work when the kernel is hidden inside a call. I didn't think this would be an issue for Nemo (and would come up later), but is an issue in NemoLite2D. I need to do something with CodedKern therefore to have this correct.

Aug 02 '22 09:08 LonelyCat124

1 should be handled for most structure references, however its notably not handled correctly when those structures are inside the declared loop's start/stop/step conditions, so I will fix that first.

Aug 02 '22 09:08 LonelyCat124

Ok, a lot of puzzling today and I made a bit of progress and hit some potential hurdles.

One thing I'm realising quickly is that since kernels use calls to things like ssha%data I might hit issues with a few things in my implementation for Kern: My idea to create a mapping of name to name won't work how I envisioned since create a copy of a kernel and then renaming the ssha symbol to ssha%data is not sensible.

It does not break right now, but it also does not generate any dependencies so I need to see whats going on tomorrow.

Aug 02 '22 15:08 LonelyCat124

I worked out why nothing was happening. Now the code breaks because when looking in a kernel the checks I did for Reference's being in the parent private region don't work any more.

This is due to the equality check on Reference checking whether the .symbols of both objects are ==. I think this is generally the correct way to do things (if you have two References on different symbols that have the same name they are not the same, they could be in different modules etc. etc.). In this case though I should just check the name and types of the symbols are the same, since sometimes we're mapping symbols to other symbols, and we have a small enough scope it should be safe to check this.

Edit: My remaining concern is that this will leave References to symbols in the wrong symbol table but I will see what happens...

Aug 03 '22 09:08 LonelyCat124

I came across another issue while progressing here. Inside even simple NemoLite2D kernels there are two extra challenges:

    SUBROUTINE bc_flather_u_code(ji, jj, ua, hu, sshn_u, tmask)
      USE physical_params_mod, ONLY: g
      INTEGER, intent(in) :: ji
      INTEGER, intent(in) :: jj
      INTEGER, dimension(:,:), intent(in) :: tmask
      REAL(KIND=go_wp), dimension(:,:), intent(inout) :: ua
      REAL(KIND=go_wp), dimension(:,:), intent(in) :: hu
      REAL(KIND=go_wp), dimension(:,:), intent(in) :: sshn_u
      INTEGER jiu

      IF (tmask(ji,jj) + tmask(ji + 1,jj) <= (-1)) THEN
        RETURN
      END IF
      IF (tmask(ji,jj) < 0) THEN
        jiu = ji + 1
        ua(ji,jj) = ua(jiu,jj) + SQRT(g / hu(ji,jj)) * (sshn_u(ji,jj) - sshn_u(jiu,jj))
      ELSE
        IF (tmask(ji + 1,jj) < 0) THEN
          jiu = ji - 1
          ua(ji,jj) = ua(jiu,jj) + SQRT(g / hu(ji,jj)) * (sshn_u(ji,jj) - sshn_u(jiu,jj))
        END IF
      END IF

    END SUBROUTINE bc_flather_u_code

The first is the local variable declaration, jiu. These can be ignored in general by checking if the symbol is in the OMPTaskDirective's parent symbol table I think (of course if there is a jiu symbol in the local symbol table as well that presents another challenge). Perhaps better is to ignore all non-translated symbols in kernels for the purposes of data-sharing clauses & dependencies, as local variables we don't need to resolve. This needs some significant work though.

The second issue is a more general issue I had not considered. If indices are not used directly (so not ua(ji+1, jj)) but instead indirectly to some variable they are somewhat opaque for working out data dependencies. I assume this issue must come up with other transformations as well (e.g. moving kernels/loops around?). Is there an example of it being resolved elsewhere? @sergisiso @arporter . Alternatively is there something else in the Kern object that describes how the input parameters are used? This could also arise in inlined code so I probably do need something better to do in general :(

Aug 03 '22 10:08 LonelyCat124

There are the API rules for kernels. These state that a kernel must only write to the ji,jj element of any array that is being updated. Any reads of anything other than ji,jj must be declared as a stencil access in the kernel metadata.

Aug 03 '22 11:08 arporter

Ok, I will look at that for kernels.

I hope that avoids a lot of issues for non-inline code and makes life easier.

I'm then just left with the challenge of analysing inline kernels - my only idea is to store every assignment in the task for each lhs variable, and if that variable is used for an array index, compute the values of those assignments if possible. It seems hard work though so I guess the question is code like:

do i = 1, N
  do j = 1, N
     jj = j +1
     ua(i,j) = ub(i, jj) * 2.0
  enddo
enddo

a common use case for PSyclone? My understand is that Nemo will often have code like this?

Aug 03 '22 12:08 LonelyCat124

@arporter I'm still struggling to find how to access the stencil info. I've found the GOStencil class (for GOcean) and some others, but its not clear to me how to extract the relevant metadata. I will leave this for now and come back to it next week, maybe a chat early next week once you're back would be helpful.

Aug 03 '22 14:08 LonelyCat124

Ok, I will look at that for kernels.

I hope that avoids a lot of issues for non-inline code and makes life easier.

I'm then just left with the challenge of analysing inline kernels - my only idea is to store every assignment in the task for each lhs variable, and if that variable is used for an array index, compute the values of those assignments if possible. It seems hard work though so I guess the question is code like:
do i = 1, N
  do j = 1, N
     jj = j +1
     ua(i,j) = ub(i, jj) * 2.0
  enddo
enddo
a common use case for PSyclone? My understand is that Nemo will often have code like this?

Not that often I think. As in, a fair chunk (perhaps the famous 80%) doesn't have code like this.

Aug 08 '22 09:08 arporter

You'll be pleased to hear that Rupert has just completed refactoring the handling of kernel metadata (for GOcean so far - LFRic is now in progress) so life should be easier for you now. I've only today merged it to master. GOceanKernelMetaData contains FieldArg objects that have a stencil property: https://psyclone-ref.readthedocs.io/en/latest/_static/html/classpsyclone_1_1domain_1_1gocean_1_1kernel_1_1psyir_1_1GOceanKernelMetadata_1_1FieldArg.html#a7b8a88abafc5a59cd921a1b97298e224

Aug 08 '22 09:08 arporter

You'll be pleased to hear that Rupert has just completed refactoring the handling of kernel metadata (for GOcean so far - LFRic is now in progress) so life should be easier for you now. I've only today merged it to master. GOceanKernelMetaData contains FieldArg objects that have a stencil property: https://psyclone-ref.readthedocs.io/en/latest/_static/html/classpsyclone_1_1domain_1_1gocean_1_1kernel_1_1psyir_1_1GOceanKernelMetadata_1_1FieldArg.html#a7b8a88abafc5a59cd921a1b97298e224

@arporter Does NEMO have anything similar for this? I've only just come across the Stencil objects (e.g. GOStencil), but its very unclear to me where to get any of this information for a generic CodedKern.

@sergisiso Do you think I should just take the general task directive and make it abstract, and specialize for each of the domains as and when required?

Aug 12 '22 09:08 LonelyCat124

@arporter Does NEMO have anything similar for this? I've only just come across the Stencil objects (e.g. GOStencil), but its very unclear to me where to get any of this information for a generic CodedKern.

Only the PSyKAl APIs currently have this information. Ideally we would attempt to use the dependence analysis to construct it for a CodedKern in NEMO but we don't do that yet - there's no stencil information. (This will be needed if we want to start reasoning about halo exchanges in the NEMO API).

Aug 12 '22 09:08 arporter

Only the PSyKAl APIs currently have this information. Ideally we would attempt to use the dependence analysis to construct it for a CodedKern in NEMO but we don't do that yet - there's no stencil information. (This will be needed if we want to start reasoning about halo exchanges in the NEMO API).

Is there a way to go from the CodedKern object to the actually kernel itself, or are the kernels just treated as black boxes?

Aug 12 '22 09:08 LonelyCat124

Is there a way to go from the CodedKern object to the actually kernel itself, or are the kernels just treated as black boxes?

Yes.. the get_kernel_schedule() method

Do you think I should just take the general task directive and make it abstract, and specialize for each of the domains as and when required?

I am not familiar enough with the implementation to give a definitive answer. If I look at the directive file it currently does not have any import from gocean, but if you end up using GOStencil then yes.

Aug 12 '22 09:08 sergisiso

It also looks like you have a very thin transformation and all the logic is in the node. Could the logic be brought to the transformation. I found it is easier to spcialize transformations for each API unless something must be kept as part of the state.

Aug 12 '22 09:08 sergisiso

Yes.. the get_kernel_schedule() method

Ok, i thought i'd tried this and it not worked but I'll test again!

It also looks like you have a very thin transformation and all the logic is in the node. Could the logic be brought to the transformation.

I don't think so, its all to do with clause generation and as with the previous PR I think it will most likely happen at lower_to_language_level time.

Aug 12 '22 09:08 LonelyCat124

Ok, i thought i'd tried this and it not worked but I'll test again!

You may need the "-d" in the psyclone command to point to the directory where the kernels are (or set up equivalent config parameter if you are in the tests).

I don't think so, its all to do with clause generation and as with the previous PR I think it will most likely happen at lower_to_language_level time.

The problem with doing things when lowering (or gen_code) is that is not composable with other transformations. If it is a transformation it can be applied and then explored/walked in the user script and then transformed again if needed.

Aug 12 '22 09:08 sergisiso

The problem with doing things when lowering (or gen_code) is that is not composable with other transformations. If it is a transformation it can be applied and then explored/walked in the user script and then transformed again if needed.

Yeah, the challenge is if I generate clauses at a time where transformations can happen afterwards, then if someone applied a transformation to the inside of the directive, my clauses might no longer be correct - does that make sense or is there a way to do that from the transformation?

Aug 12 '22 10:08 LonelyCat124

Ah you are right. This is a good reason!

Aug 12 '22 10:08 sergisiso

Aha no, I circled back round and the issue I had with nemo was a kernel that has:

jiu = ji + 1
array[jiu] = ...

as I can't track what jiu is right now, and it doesn't appear as a private variable (because its local to the routine). I need to think about how best to track local variables doh.

Aug 12 '22 12:08 LonelyCat124

as I can't track what jiu is right now, and it doesn't appear as a private variable (because its local to the routine). I need to think about how best to track local variables doh.

You may be best just disabling the transformation for the NEMO API for now?

Aug 12 '22 14:08 arporter

as I can't track what jiu is right now, and it doesn't appear as a private variable (because its local to the routine). I need to think about how best to track local variables doh.

You may be best just disabling the transformation for the NEMO API for now?

Potentially, but Nemo is my only testcase right now so I'd rather fix it up - I think I have a fairly easy solution that should work.

I also spoke with Sergi and there is some indirection in array accesses I need to solve for LFRic at least in the future as well, so I was going to make the base implementation work for NemoLite2D and perhaps disable it for the others for now.

Edit: my solution was close but thrown by a loop due to the ssha_t%data arguments and how symbol tables work. I will keep looking at this.

Aug 16 '22 08:08 LonelyCat124

PSyclone PSyclone copied to clipboard

(Towards #1338) task transformation and directive

Codecov Report

PSyclone
PSyclone copied to clipboard