PSyclone
PSyclone copied to clipboard
[psyir] Support allocate/deallocate and random_number intrinsics
At this stage, an allocate statement becomes a code block. For the kernel-extraction driver creation, direct support for creating allocate statements would simplify the code (which currently parses a Fortran string to create a codeblock with the allocate statement) is used.
Note that there is also https://github.com/stfc/fparser/issues/298 to support the mold=
argument. It would simplify the code further if mold
would be supported.
This support would also be useful when hoisting local, automatic arrays (#1634). Currently the required allocated()
and allocate()
calls have to be put into CodeBlocks.
I'm thinking of taking this one on as it's needed for #1726. We should also address https://github.com/stfc/fparser/issues/298.
In Fortran we have:
allocate(my_array(x,y,z), [mold=src_array, status=ierr])
and in C we have:
my_array = malloc(x*y*z)
and one checks for success by examining whether my_array
is null
. We could support the 'mold' concept since in C that would simply be a sizeof(src_array)/sizeof(src_array[0]).
Are there other aspects to consider @hiker, @sergisiso and @rupertford?
Naming: Fortran has allocate
and deallocate
, C has malloc
and free
where "malloc" really stands for "memory allocation." I favour having nodes Allocate
and Deallocate
. Although Fortran includes support for specifying the type of the entity (for OO style code), that moves us away from the functionality offered by malloc
so we could simply exclude that?
Just thought: @LonelyCat124 might have an opinion on this too.
Is the plan for these to be a special type of Codeblock node? We should probably prevent code with allocate/deallocate statements from being inside parallel regions unless we have a good reason to allow it? There are certainly algorithms where memory allocation inside parallel regions is a thing, but computing things like shared/private might be not be worth allowing for now.
Allocate
/Deallocate
nodes seems reasonable to me, my only concern may be if a C++ backend (e.g. Kokkos) or something is planned how we handle that, though mostly I think C++ avoids explicit allocation now.
Does anything special need to be done for this to support GPU outputs? Or do you just allocate locally and leave it to OpenMP/OpenCL/...?
Is the plan for these to be a special type of Codeblock node?
No, I was planning on them being full Statement
nodes. We would have to update the various parallel directive classes/transformations to ensure that they are excluded from being inside.
The OO thing is something we've stayed away from so far so I think we're really talking about something that grabs a certain number of bytes, with support for specifying array shape. I haven't thought about Kokkos but maybe @sergisiso can comment?
I hadn't thought about the implications for GPU. I guess the very fact that we recognise allocates would mean that we could then replace them with allocates directly on the GPU. Certainly it would provide more information for the work @nmnobre is doing on explicit data movement.
Some ideas:
-
Do we need them as specific nodes? Maybe being just new operators like the other Fortran intrinsics would be enough for them not being in a codeblock. Is that enough for the psyad use case?
-
If we need to rationalize about memory usage (e.g. for GPU) I think we need to add an attribute to datatype to mark them as "heap allocated" because the important thing is the symbol info - not the specific statement. (akind to the Fortran allocatable attribute). But maybe both are separate things, we can implement one without the other. If/when we have both we can typecheck that allocations happen with allocatable types.
-
I agree using the Fortran names (as we use for most concepts). I would not worry about other backends and favor making things simple for Fortran. For other backends we could deal with the syntax change in their backend or simply don't support it (if we only convert to the other language the kernels and let the driver still be Fortran it is sensible to ask for no allocations in kernels so if one is found we request the user to hoist it outside the kernel)
That sounds good Sergi. There's a slight issue with the stat argument to allocate, e.g.:
allocate(array1(2,2), array2(3,2), stat=ierr)
if (ierr /= 0)then
...
Presumably in this case we'd have something like:
alloc1 = NaryOperation.create(
NaryOperation.Operator.ALLOCATE,
Reference(array1_sym), [two, two])
Assignment.create(lhs=Reference(ierr1), rhs=alloc1)
alloc2 = NaryOperation.create(
NaryOperation.Operator.ALLOCATE,
Reference(array2_sym), [three, two])
Assignment.create(lhs=Reference(ierr2), rhs=alloc2)
Assignment.create(Reference(ierr), SUM(ierr1, ierr2))
We could have status
as an optional argument to the create
method or potentially have it as the result of the operation (I prefer the latter)? There's also the mold
option that @hiker is keen to have.
Since mold
is just another way to specify the shape (and potentially type) of the thing being allocated it can be expressed in terms of other operators:
arank = RANK(array1)
for i in range(arank):
a_shape.append(size(array1, i))
NaryOperation.create(NaryOperation.Operator.ALLOCATE,
Reference(barray), a_shape)
However, that's an internal thing. For our purposes we can support a shape specified as a list of DataNodes
or via an ArrayReference
.
If some existing Fortran has an allocate
without a stat
argument then we have nothing to assign the result of this proposed operation to. We can't have a bare Operation
as a child of Schedule
as it's not a Statement
. This is the same problem we have with trying to represent the Fortran random_number
intrinsic as an operation. We could represent such operations as
assigning to the operand:
var = ALLOCATE(size, status)
and
var = RANDOM()
although in the latter case we then have an Operation with no operand unless we do:
var = RANDOM(var)
which is a bit clunky and will cause problems if someone does:
other_var = RANDOM(var)
since in Fortran that would really be call random_number(other_var)
with no dependence on var
.
This problem is also an issue for deallocate(my_var)
.
Following discussion on Teams, we have agreed to move away from "having all intrinsics as operators." To that end we will introduce a new IntrinsicCall
node.
We have an implementation choice:
- Have just one
IntrinsicCall
class and allow the type of the intrinsic to be specified when it is constructed (c.f. operators); - Have a sub-class of
IntrinsicCall
for each type of intrinsic, e.g.Allocate
,Deallocate
etc. Option 1. is quite concise but makes it difficult to validate the supplied arguments. It is also difficult to support named arguments. Option 2. could wind up with a lot of extra classes but allows us to validate arguments easily and support named arguments. The classes that I know we need currently would be:Allocate
,Deallocate
,Random
. Probably there are others?
Hi @sergisiso and @rupertford, would you mind taking a look at the commit (https://github.com/stfc/PSyclone/commit/5e04f94f894c19267b116072deabc570df418722) and see what you think? (I've implemented Option 1. but have managed to add a fair amount of argument validation as well.)
I like the general direction. I also prefer option 1 rather than 2.
The validation of the supplied arguments that you do inside create is something that at some point we will also want the base class to do when we have the Call and the RoutineSymbol declaration/signature. Since we don't do it yet it is fine to be here for now but keep in mind (TODO?) that we could generalize this. Similarly, I am unsure if the _required_args and _optional_args should be inside the Intrinsic node. Since this is just the Call and they represent the signature, to me they are different things, maybe they belong to the IntrinsicSymbol. But again, since we don't have proper generic signatures in RoutineSymbols it may be a good solution for now.
We need to decide whether an allocate call should follow the Fortran syntax and allow multiple entities to be declared at once. This makes supporting the specification of a single status variable easy. Perhaps status checking should be incorporated in the node (when it is lowered)?