PSyclone
PSyclone copied to clipboard
(Towards #1338) task transformation and directive
A task directive needs to generate dependencies correctly at codegen time.
There are a few principles I've used to determine what these should be:
- The idea of a proxy loop. When a chunked loop is used around a task directive, the loop variable for the inner loop is considered to be a proxy for the outer loop, and dependencies are created accordingly. To do this, the inner loop must have a strcutre of
do i = outer_var, outer_var+outer_step, step
(I think? Something like this), and then any dependencies are computed based upon the outer loop variable instead. - Array dependency access to non-proxy loop variables fully inside the task are replaced with
:
as they are assumed to access any/all array indices. - A binary operation access to a proxy loop variable (e.g.
i+1
) will instead generate multiple dependencies ofi
andi+outer_step
to cover this dependency to give a chance to generate valid dependencies.
The rest is mostly just dependencies as expected, with some restrictions of the types of variable (i.e. no shared array access, no array access inside an array access) to avoid ridiculously complex logic to handle these cases for general purpose implementations.
I will copy some comments over as well from the old PR.
This is dependent on #1551 and will appear as all the code from that PR is here for now.
Codecov Report
All modified lines are covered by tests :white_check_mark:
Comparison is base (
48f9f26
) 99.84% compared to head (f148551
) 99.84%.
Additional details and impacted files
@@ Coverage Diff @@
## master #1822 +/- ##
========================================
Coverage 99.84% 99.84%
========================================
Files 344 347 +3
Lines 46013 46916 +903
========================================
+ Hits 45942 46845 +903
Misses 71 71
Files | Coverage Δ | |
---|---|---|
src/psyclone/errors.py | 100.00% <100.00%> (ø) |
|
src/psyclone/psyir/nodes/__init__.py | 100.00% <100.00%> (ø) |
|
src/psyclone/psyir/nodes/array_mixin.py | 100.00% <100.00%> (ø) |
|
...psyclone/psyir/nodes/dynamic_omp_task_directive.py | 100.00% <100.00%> (ø) |
|
src/psyclone/psyir/nodes/loop.py | 100.00% <ø> (ø) |
|
src/psyclone/psyir/nodes/omp_directives.py | 100.00% <100.00%> (ø) |
|
src/psyclone/psyir/nodes/omp_task_directive.py | 100.00% <100.00%> (ø) |
|
src/psyclone/psyir/transformations/__init__.py | 100.00% <100.00%> (ø) |
|
...c/psyclone/psyir/transformations/omp_task_trans.py | 100.00% <100.00%> (ø) |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Ok, looking back at this again today, and found a big oversight in my handle_binary_op
strategy, which I'm unclear on the way to fix.
My assumption had been that if I had some code inside my task, e.g.:
do j=1, 320, 32
!$omp task ....
do i=j, j+32
a(i) = b(i+1) - 3
...
was the clause(s) for the access to b
could be handled by a single clause, however this is not the case. All clauses need to access a binary opreation of j
and a multiple of the outer loop's step size. So in this case, to cover all values of b
accessed by a single iteration, the correct clause needs to be depend(in: b(j),b(j + 32))
. Right now I have no way to handle this, and it gets even more complex if there are more nested loops, e.g.
do jj=1, 320, 32
do ii = 1, 320, 32
!$omp task ...
do i=ii, ii+32
do j=jj, jj+32
a(i, j) = b(i+1, j+1)
...
the required clauses is probably depend(in: b(i,j), b(i+32, j) b(i, j+32), b(i+32, j+32))
.
The current handle_binary_op
adds the new index (i.e. a BinaryOperation
or Reference
) to the index_list
, which doesn't work when multiple are required. Instead I think the index_list
needs to be a list of size N (where N is the number of indices) where each element is either:
- A
BinaryOperation
- A
Reference
(or subclass) - A
Literal
- A
list
The final set of index_lists can then be constructed from these. So if we had:
[ binop1, [binop2, binop3]]
the clauses would be: (binop1, binop2)
and (binop1, binop3)
, or for a worse case:
[ [binop1, binop2], [binop3, binop4] ]
the clauses would be (binop1, binop3)
, (binop2, binop3)
, (binop1, binop4)
, (binop2, binop4)
.
The difficulty (from a design perspective) I have right now is I don't have a good strategy to construct all the permutations efficiently but I will continue to think about it.
Edit: Itertools.product can do this for me.
I fixed the previous issue, and have run into another oversight I need to fix. A normal use case might be something like:
do i = 1, 320
do j = 1, 32
a(i,j) = b(i+1, j) + k
enddo
enddo
or similar. A simple strategy in this case, would be to chunk the outer loop and create tasks along it:
do i = 1, 320, 32
!$omp task firstprivate(ii), private(i, j), ....
do ii = i, i + 32
do j = 1, 32
a(ii,j) = b(ii+1, j) + k
enddo
enddo
enddo
In this instance, ii
is essentially a placeholder for the relevant value of i
, so for the purposes of generating the dependencies it needs to be treated as such. I forgot about this when creating a lot of the code generation (oops) so its not taken into account I don't think. I will have to go through the code and find out a good way to handle this.
I'm probably going to put design notes for the new ideas I have in this PR now.
Operation order:
- Find all parent loops inside the same parallel region. Its not entirely clear how to do this efficiently, but probably it just has to involve recursive calls to
ancestor
using either Loop or OMPParallelRegion for the type argument. For each Loop, pull out the symbol and store it for later, and set it to be firstprivate. - Find all child loops inside the task region. For each of those loops, check if they are the inner section of a chunked loop structure, i.e. are they initialised as
do var = parent_loop_var, ...
. If so, then we can store that Loop's variable as a "proxy" for the relevant parent_loop_var (i.e. when encountering this variable, it should be treated as the parent loop's variable for the purposes of dependency analysis). The loop variable itself is declared as private. Notably this is not totally straightforward if there are multiple loops using the same loop variable, this "proxy-ing" should be done on a per-loop basis, and only apply to the children of that loop.
The requirement to handle 2 correctly makes the next step more complex than the initial implementation.
- We want to find all children of the node's immediate Loop child (to which the transformation was applied) which are either:
Loop
,Assignment
,IfBlock
. - For
Loop
nodes, we need to work out if they are a proxy loop, and then recurse through their children to find allAssignment
/IfBlock
/Loop
nodes. - For
Assignment
nodes, we perform basically the same operation as now, computing all writes toshared
variables into out dependencies, and all reads toshared
variables as in dependencies. - For
IfBlock
nodes we perform the same operations as now, computing the relevant in dependencies and private/firstprivate if appropriate.
I'm hopeful that these 6 steps are sufficient to handle the full functionality needed for tasking.
handle_binary_op
at the moment does the following things, and is called only when a binary operation is found inside an array index:
- If the binary operation is not an
ADD
orSUB
, throws an exception - if the binary operation is not a
LITERAL OP REFERENCE
(or the opposite order) then throw an exception. This is possibly too strict (as one could doLITERAL OP LITERAL
safely), but since those cases are unusual code I'm going to ignore them for now. - Find which of the children are the
Reference
, and store some info based on that for reconstructing the operation required for the clauses later. - There are then 2 options for how to handle this, and it depends on whether the
Reference
is to one of the proxyparent_loop_var
s or not.
If the Reference
is to a parent_loop_var
proxy, we know that the Reference
must be private (and thus in private_list
).
In this case, we need to compute the set of clauses required, based upon the Literal
's value and the loop step size for the proxy variable.
Note: At the moment the variable has no way to reference the loop that it is the variable to, so we should store that Loop node in the proxy system too.
If the Reference
is to a loop variable that is private and not a proxy, then we need to create a depend clause to that entire array (i.e. :
) for that index.
If the Reference
is to any other private variable, we assume its just a constant so just use a copy.
If the Reference
variable is in firstprivate_list
, then it currently checks if the variable is from a parent loop. If it is, then we use the same code as for when its a proxy variable, which I think is wrong. We can instead just use acopy the reference (since the parent loop variable is constant). I think this was done to handle the proxy variables from before.
If it is a child loop's variable, it should throw an exception (child loop variables are not allowed to be firstprivate). In fact I think this can never happen, so its mostly a sanity check for now, as its untestable code I think.
Otherwise, just use a copy of the reference.
Finally, if the variable has not yet been set private or firstprivate, we set it to firstprivate and then perform the same operations as for when its firstprivate.
If the variable is shared, then we throw an exception for now. If shared variables are used, they're in principle supportable, but I don't have a good use-case (since shared variables for now are ArrayReferences, which are banned from being used to index inside task directives).
I think I will split this function into functions to handle private and firstprivate accesses seperately.
Edit 1: A non-split version of this for read-only binary ops is now implemented.
handle_read_only_ref
at the moment does the following things:
- For
(ArrayReference, ArrayOfStructuresReference)
: At the moment, the initial implementation checks whether this ArrayReference was private in the parent parallel. This can never happen so can be removed. We thus know the Array is shared, so we loop over the indices. If anArray, AoS or StructureReference
appears in the indexing we throw an exception. If the index is aReference
, handle computing the dependence & index structure accordingly, depending on if its a child loop, proxy loop or general variable. If its a binary operation then thehandle_binary_op
function is used. Anything else is disallowed (though I think a Literal should also be fine...). This results in a list of (lists) of indexes, and itertools.product is used to create all sets of clauses required to construct - For
StructureReference
: Create a reference to the symbol of the reference (for use later). Perform the same checks done for a generalReference
on this new reference. - For
Reference
: If the variable was declared private in the parent parallel region, and is not yet in the private_list or firstprivate_list for this clause, add it to the firstprivate_list (as it is read before being written to, so we need some data). If it was not declared private in the parent parallel region, it is a shared variable, so we need to add an in dependency to thisReference
.
Edit: This is implemented with its subfunctions.
There are some uncovered loc in this PR right now, but they are implemented in #1551 so will be fixed when that is merged and this is rebased to master.
Ok - I tested this with NemoLite2D and found a variety of issues/bugs to fix.
- StructureReference need to only add the outermost member to a clause, as it should be
ax
and notax%thing
as the latter is not valid. - This directive doesn't work when the kernel is hidden inside a call. I didn't think this would be an issue for Nemo (and would come up later), but is an issue in NemoLite2D. I need to do something with CodedKern therefore to have this correct.
1 should be handled for most structure references, however its notably not handled correctly when those structures are inside the declared loop's start/stop/step conditions, so I will fix that first.
Ok, a lot of puzzling today and I made a bit of progress and hit some potential hurdles.
One thing I'm realising quickly is that since kernels use calls to things like ssha%data
I might hit issues with a few things in my implementation for Kern
:
My idea to create a mapping of name to name won't work how I envisioned since create a copy of a kernel and then renaming the ssha
symbol to ssha%data
is not sensible.
It does not break right now, but it also does not generate any dependencies so I need to see whats going on tomorrow.
I worked out why nothing was happening. Now the code breaks because when looking in a kernel the checks I did for Reference's being in the parent private region don't work any more.
This is due to the equality check on Reference checking whether the .symbol
s of both objects are ==
. I think this is generally the correct way to do things (if you have two Reference
s on different symbol
s that have the same name they are not the same, they could be in different modules etc. etc.).
In this case though I should just check the name and types of the symbols are the same, since sometimes we're mapping symbols to other symbols, and we have a small enough scope it should be safe to check this.
Edit: My remaining concern is that this will leave Reference
s to symbols in the wrong symbol table but I will see what happens...
I came across another issue while progressing here. Inside even simple NemoLite2D kernels there are two extra challenges:
SUBROUTINE bc_flather_u_code(ji, jj, ua, hu, sshn_u, tmask)
USE physical_params_mod, ONLY: g
INTEGER, intent(in) :: ji
INTEGER, intent(in) :: jj
INTEGER, dimension(:,:), intent(in) :: tmask
REAL(KIND=go_wp), dimension(:,:), intent(inout) :: ua
REAL(KIND=go_wp), dimension(:,:), intent(in) :: hu
REAL(KIND=go_wp), dimension(:,:), intent(in) :: sshn_u
INTEGER jiu
IF (tmask(ji,jj) + tmask(ji + 1,jj) <= (-1)) THEN
RETURN
END IF
IF (tmask(ji,jj) < 0) THEN
jiu = ji + 1
ua(ji,jj) = ua(jiu,jj) + SQRT(g / hu(ji,jj)) * (sshn_u(ji,jj) - sshn_u(jiu,jj))
ELSE
IF (tmask(ji + 1,jj) < 0) THEN
jiu = ji - 1
ua(ji,jj) = ua(jiu,jj) + SQRT(g / hu(ji,jj)) * (sshn_u(ji,jj) - sshn_u(jiu,jj))
END IF
END IF
END SUBROUTINE bc_flather_u_code
The first is the local variable declaration, jiu
. These can be ignored in general by checking if the symbol is in the OMPTaskDirective's parent symbol table I think (of course if there is a jiu
symbol in the local symbol table as well that presents another challenge). Perhaps better is to ignore all non-translated symbols in kernels for the purposes of data-sharing clauses & dependencies, as local variables we don't need to resolve. This needs some significant work though.
The second issue is a more general issue I had not considered. If indices are not used directly (so not ua(ji+1, jj)
) but instead indirectly to some variable they are somewhat opaque for working out data dependencies. I assume this issue must come up with other transformations as well (e.g. moving kernels/loops around?). Is there an example of it being resolved elsewhere? @sergisiso @arporter . Alternatively is there something else in the Kern
object that describes how the input parameters are used? This could also arise in inlined code so I probably do need something better to do in general :(
There are the API rules for kernels. These state that a kernel must only write to the ji,jj
element of any array that is being updated. Any reads of anything other than ji,jj
must be declared as a stencil access in the kernel metadata.
Ok, I will look at that for kernels.
I hope that avoids a lot of issues for non-inline code and makes life easier.
I'm then just left with the challenge of analysing inline kernels - my only idea is to store every assignment in the task for each lhs variable, and if that variable is used for an array index, compute the values of those assignments if possible. It seems hard work though so I guess the question is code like:
do i = 1, N
do j = 1, N
jj = j +1
ua(i,j) = ub(i, jj) * 2.0
enddo
enddo
a common use case for PSyclone? My understand is that Nemo will often have code like this?
@arporter I'm still struggling to find how to access the stencil info. I've found the GOStencil
class (for GOcean) and some others, but its not clear to me how to extract the relevant metadata. I will leave this for now and come back to it next week, maybe a chat early next week once you're back would be helpful.
Ok, I will look at that for kernels.
I hope that avoids a lot of issues for non-inline code and makes life easier.
I'm then just left with the challenge of analysing inline kernels - my only idea is to store every assignment in the task for each lhs variable, and if that variable is used for an array index, compute the values of those assignments if possible. It seems hard work though so I guess the question is code like:
do i = 1, N do j = 1, N jj = j +1 ua(i,j) = ub(i, jj) * 2.0 enddo enddo
a common use case for PSyclone? My understand is that Nemo will often have code like this?
Not that often I think. As in, a fair chunk (perhaps the famous 80%) doesn't have code like this.
You'll be pleased to hear that Rupert has just completed refactoring the handling of kernel metadata (for GOcean so far - LFRic is now in progress) so life should be easier for you now. I've only today merged it to master. GOceanKernelMetaData
contains FieldArg
objects that have a stencil
property: https://psyclone-ref.readthedocs.io/en/latest/_static/html/classpsyclone_1_1domain_1_1gocean_1_1kernel_1_1psyir_1_1GOceanKernelMetadata_1_1FieldArg.html#a7b8a88abafc5a59cd921a1b97298e224
You'll be pleased to hear that Rupert has just completed refactoring the handling of kernel metadata (for GOcean so far - LFRic is now in progress) so life should be easier for you now. I've only today merged it to master. GOceanKernelMetaData contains FieldArg objects that have a stencil property: https://psyclone-ref.readthedocs.io/en/latest/_static/html/classpsyclone_1_1domain_1_1gocean_1_1kernel_1_1psyir_1_1GOceanKernelMetadata_1_1FieldArg.html#a7b8a88abafc5a59cd921a1b97298e224
@arporter Does NEMO have anything similar for this? I've only just come across the Stencil objects (e.g. GOStencil), but its very unclear to me where to get any of this information for a generic CodedKern
.
@sergisiso Do you think I should just take the general task directive and make it abstract, and specialize for each of the domains as and when required?
@arporter Does NEMO have anything similar for this? I've only just come across the Stencil objects (e.g. GOStencil), but its very unclear to me where to get any of this information for a generic CodedKern.
Only the PSyKAl APIs currently have this information. Ideally we would attempt to use the dependence analysis to construct it for a CodedKern in NEMO but we don't do that yet - there's no stencil information. (This will be needed if we want to start reasoning about halo exchanges in the NEMO API).
Only the PSyKAl APIs currently have this information. Ideally we would attempt to use the dependence analysis to construct it for a CodedKern in NEMO but we don't do that yet - there's no stencil information. (This will be needed if we want to start reasoning about halo exchanges in the NEMO API).
Is there a way to go from the CodedKern object to the actually kernel itself, or are the kernels just treated as black boxes?
Is there a way to go from the CodedKern object to the actually kernel itself, or are the kernels just treated as black boxes?
Yes.. the get_kernel_schedule() method
Do you think I should just take the general task directive and make it abstract, and specialize for each of the domains as and when required?
I am not familiar enough with the implementation to give a definitive answer. If I look at the directive file it currently does not have any import from gocean, but if you end up using GOStencil then yes.
It also looks like you have a very thin transformation and all the logic is in the node. Could the logic be brought to the transformation. I found it is easier to spcialize transformations for each API unless something must be kept as part of the state.
Yes.. the get_kernel_schedule() method
Ok, i thought i'd tried this and it not worked but I'll test again!
It also looks like you have a very thin transformation and all the logic is in the node. Could the logic be brought to the transformation.
I don't think so, its all to do with clause generation and as with the previous PR I think it will most likely happen at lower_to_language_level
time.
Ok, i thought i'd tried this and it not worked but I'll test again!
You may need the "-d" in the psyclone command to point to the directory where the kernels are (or set up equivalent config parameter if you are in the tests).
I don't think so, its all to do with clause generation and as with the previous PR I think it will most likely happen at lower_to_language_level time.
The problem with doing things when lowering (or gen_code) is that is not composable with other transformations. If it is a transformation it can be applied and then explored/walked in the user script and then transformed again if needed.
The problem with doing things when lowering (or gen_code) is that is not composable with other transformations. If it is a transformation it can be applied and then explored/walked in the user script and then transformed again if needed.
Yeah, the challenge is if I generate clauses at a time where transformations can happen afterwards, then if someone applied a transformation to the inside of the directive, my clauses might no longer be correct - does that make sense or is there a way to do that from the transformation?
Ah you are right. This is a good reason!
Aha no, I circled back round and the issue I had with nemo was a kernel that has:
jiu = ji + 1
array[jiu] = ...
as I can't track what jiu is right now, and it doesn't appear as a private variable (because its local to the routine). I need to think about how best to track local variables doh.
as I can't track what jiu is right now, and it doesn't appear as a private variable (because its local to the routine). I need to think about how best to track local variables doh.
You may be best just disabling the transformation for the NEMO API for now?
as I can't track what jiu is right now, and it doesn't appear as a private variable (because its local to the routine). I need to think about how best to track local variables doh.
You may be best just disabling the transformation for the NEMO API for now?
Potentially, but Nemo is my only testcase right now so I'd rather fix it up - I think I have a fairly easy solution that should work.
I also spoke with Sergi and there is some indirection in array accesses I need to solve for LFRic at least in the future as well, so I was going to make the base implementation work for NemoLite2D and perhaps disable it for the others for now.
Edit: my solution was close but thrown by a loop due to the ssha_t%data
arguments and how symbol tables work. I will keep looking at this.