PSyclone FFSL transport in LFRic depends on a kernel that loops over halo cells

One crucial part of the FFSL transport scheme in LFRic involves a kernel which loops over only cells in the halos. This is obviously not currently supported by PSyclone, so is currently done through a psykal_lite implementation: see the invoke_ffsl_panel_swap_kernel_type here: https://code.metoffice.gov.uk/trac/lfric_apps/browser/main/trunk/science/gungho/source/psy/psykal_lite_transport_mod.F90?rev=755#L786

To summarise what needs to happen here: we need to perform a halo exchange to the full halo depth, and loop over all halo cells (and not owned cells)

This loop over the halo cells doesn't really fit in the current design of LFRic kernels, but to avoid a psykal_lite implementation, I think we need new metadata. Two first suggestions I have thought of are:

having a GH_HALOWRITE access permission
introducing a CELL_HALOCOLUMN value for the operators_on variable

Mea culpa: I should've opened this issue several months ago when the code was added to trunk. I overlooked it and recently remembered, so sorry about that!

Jul 15 '24 15:07 tommbendall

I thought I'd just add another comment here, as there is a whole category of kernels that currently have psykal lite implementations because they loop to some depth in the halo regions, before setting the output field to be clean (to ensure that halo swaps don't happen subsequently on these fields).

It might be helpful to tackle this other problem together, with whatever solution is best for this issue

Jul 19 '24 14:07 tommbendall

Thanks @tommbendall. Those kernels that "loop to some depth in the halo regions" - is that depth set at compile-time or run-time? To summarise, we have:

Kernels that write only to halo cells (to a certain depth?)
Kernels that write to owned and halo cells (to a certain depth)

It feels like that information naturally belongs to operates_on as that's what that describes. The access permissions relate to how dofs in a given cell are accessed so I don't think that works in this context.

Jul 24 '24 12:07 arporter

Thanks Andy, I agree with that summary and that it makes most sense to add this through the operates_on variable. I suppose that the invoke would also require a depth argument.

Jul 25 '24 07:07 tommbendall

Currently, the docs for operates_on say: We need to extend this to support the two further cases given above. Really, CELL_COLUMN is effectively OWNED_CELL_COLUMN and then we would naturally have HALO_CELL_COLUMN and OWNED_PLUS_HALO_CELL_COLUMN. For backwards compatibility we can make OWNED_CELL_COLUMN a synonym for the existing CELL_COLUMN.

As Tom says, both of the new values of 'operates-on' will require a halo depth to be supplied. If this is only known at run-time then it will need to be passed to the kernel in the invoke, e.g.:

call invoke(a_new_kern(halo_depth, field1, field2))

I suggest that we pass this depth to the kernel subroutine immediately following the nlayers argument as it is not associated with any particular kernel argument.

However, if the depth for a particular kernel is always going to be fixed then we could think about extending the metadata to capture this value.

How does that sound @tommbendall and @TeranIvy?

Aug 08 '24 10:08 arporter

Thanks Andy, that sounds really good to me.

Trying to think ahead, the depth to which one would loop into the haloes would probably be:

user defined to something like 1
to the full depth of the halo
to the minimum clean depth of the field arguments that are provided

With what you're suggesting, I think that could still all be covered by the user getting that information in the algorithm layer of the code

Aug 09 '24 14:08 tommbendall

Thanks Tom. Yes, your last two options could be captured in meta-data but that would require more engineering. However, if they are going to be the bulk of the use cases then it's probably worth implementing this as otherwise that's a lot of duplication in various algorithm layer code.

Aug 28 '24 12:08 arporter

Right now, I think there are only a handful of these kernels but they are pretty much evenly split between the above cases!

Because there aren't many kernels, I'm not worried about algorithm-layer duplication, but I'd be more worried about the amount of engineering that you have to do to cover a couple of kernels.

So to me, it seems the best option right now is to always require that the user passes in the halo depth towhich they want to loop. This still allows all situations, since the user can can still decide in the algorithm layer if this is 1, the full depth or the clean depth. That also hopefully simplifies the psyclone side of things...

Aug 29 '24 07:08 tommbendall

Right, so now I have: and does that look OK @tommbendall ?

Sep 12 '24 17:09 arporter

Hi Andy, yes that looks great, thanks.

One final thought I have is about the description of owned_cell_column. If a kernel is performing an INC operation on a continuous field, we currently (correctly) loop into the halos to depth 1. I think we still want to specify this situation with the cell_column metadata -- I don't know if we should update the description for owned_cell_column to reflect that nuance.

Maybe this is overcomplicating things so please feel free to ignore, but the description could say: "Single column of cells from the owned region. If performing an INC operation on continuous fields this will loop to depth one in the halo, but otherwise the columns are exclusively from the owned region."

Sep 13 '24 08:09 tommbendall

Finally getting back to this @tommbendall. I've updated the doc as per your suggestion. Hopefully I'm correct in thinking that a kernel that operates on HALO_CELL_COLUMN or OWNED_AND_HALO_CELL_COLUMN can't have any stencil accesses?

Sep 20 '24 11:09 arporter

Thanks Andy. Sorry to be a pain but actually I think there is a kernel which would loop into the haloes and also use a stencil: https://code.metoffice.gov.uk/trac/lfric_apps/browser/main/trunk/science/gungho/source/kernel/transport/common/remap_on_extended_mesh_kernel_mod.F90

which currently has this psykal_lite implementation: https://code.metoffice.gov.uk/trac/lfric_apps/browser/main/trunk/science/gungho/source/psy/psykal_lite_transport_mod.F90#L212

Is it possible to be able to support stencils and looping into the halos? I think the main restriction would be that the stencil_depth + halo_depth_to_iterate_over can't exceed the full halo depth

Sep 20 '24 11:09 tommbendall

Aha! So much for my thinking :-)

Sep 20 '24 12:09 arporter

TODO:

[ ] Ensure that 'special' kernels (inter-grid, CMA) reject an OPERATES_ON that goes into the halos?
[x] Update loop bound generation for these new kernels (see below);
[x] Ensure the halo-exchange placement is updated to allow for these kernels;
[x] Extend the Algorithm Invoke parsing to handle the new halo-depth argument;
[ ] Ensure the halo-depth argument is passed to the PSy-layer routine;

The psykal_lite that Tom points to above contains:

cell_start = mesh%get_last_edge_cell() + 1
cell_end   = mesh%get_last_halo_cell(halo_compute_depth)

for the upper loop boundary. halo_compute_depth is passed into this PSy routine by argument.

Sep 20 '24 13:09 arporter

Another question for @tommbendall : what should we do if distributed-memory support is switched off? Can we 'optimise out' any calls to kernels that have operates_on=HALO_CELL_COLUMN? Should everything else 'just work' (i.e., the call to mesh%get_last_halo_cell(...) will do the right thing)?

EDIT: actually, by default, PSyclone won't bother to get a mesh object in this case so perhaps we just call any OWNED_AND_HALO_CELL_COLUMN kernels with a halo depth of zero?

Sep 23 '24 10:09 arporter

Thanks for the question! I think OWNED_AND_HALO_CELL_COLUMN kernels should be executed with a halo depth of zero (so that they become equivalent to OWNED_CELL_COLUMN kernels. As you say, HALO_CELL_COLUMN kernels should then not be executed at all. I think even those might just work though...

Sep 23 '24 16:09 tommbendall

I'm starting to get there now but am painfully aware that this is probably going to clash horribly with what @sergisiso is doing in moving LFRic to use the PSyIR backend. Is it worth me merging your branch into mine do you think?

Sep 24 '24 14:09 arporter

As discussed yesterday, feel free to go ahead implementing this on master. Although the new backend failing tests are going down, it will still take some months until everything is clean up and ready to merge.

Sep 25 '24 08:09 sergisiso

I had a brainwave and thought that perhaps I could use the existing Dynamo0p3RedundantComputationTrans to implement the necessary functionality. However, that only accepts an integer for the depth of the halo to compute into. Unfortunately, the loop-bounds and halo-exchange logic is not coded very well - having an integer depth is handled differently to having a variable specifying the depth (as is permitted for a kernel with a stencil). Also, in the latter case, the halo is only ever read so that only affects the halo-exchange calls, not the loop bounds.

Sep 30 '24 11:09 arporter

I was going to allow halo kernels to be optimised out of the PSy layer if distributed-memory is disabled. However, that causes a lot of issues as we can then have an empty invoke and we simply aren't setup to handle this situation. Since it's an edge case, it doesn't feel like it's worth putting a lot of effort into it at present.

Sep 30 '24 14:09 arporter

@tommbendall just a heads-up that argument_mod.F90 in LFRic core will need to have the new iteration spaces added to it. In the copy in our test infrastructure I have:

  !> @defgroup operates_on Enumeration of kernel iterator property descriptors.
  !> @{
  integer, public, parameter :: CELL_COLUMN                = 396
  integer, public, parameter :: OWNED_CELL_COLUMN          = 396
  integer, public, parameter :: HALO_CELL_COLUMN           = 397
  integer, public, parameter :: OWNED_AND_HALO_CELL_COLUMN = 398
  integer, public, parameter :: DOMAIN                     = 945
  integer, public, parameter :: DOF                        = 712

Sep 30 '24 15:09 arporter

Another point for consideration by @TeranIvy and @tommbendall : my 'bright' idea of changing "cell_column" to be "owned_cell_column" has more consequences than I thought. If I change PSyclone so that any kernel that specifies "cell_column" is actually stored as "owned_cell_column" then of course, any optimisation script that looks for kernels that operate on "cell_column" will now have to be updated. Is that OK or should I drop this idea of renaming "cell_column" entirely?

Oct 01 '24 08:10 arporter

Also, a question for @tommbendall: we need to think about what state the halo regions will be in after these new kernel types have been called. Are the rules the same as for standard 'owned_cell_column' kernels? i.e. if the kernel updates a field on a continuous space then the outermost halo cells that are written to will still be 'dirty' because they won't have correct answers?

Oct 01 '24 08:10 arporter

Thanks for all this Andy.

I've made a branch and ticket to pick up the changes to argument_mod.F90: https://code.metoffice.gov.uk/trac/lfric/ticket/4452

I could only spot one part of the optimisation scripts that specify "cell_column" and I have updated it to "owned_cell_column" as part of the lfric-core ticket, so from my point-of-view that's OK but @TeranIvy might see something that I haven't!

In terms of the halo regions, I think the halo regions that have been written to must be set as clean, but the outer-most halo regions should be marked as dirty.

Oct 01 '24 10:10 tommbendall

I thought I'd just add another comment here, as there is a whole category of kernels that currently have psykal lite implementations because they loop to some depth in the halo regions, before setting the output field to be clean (to ensure that halo swaps don't happen subsequently on these fields).

It might be helpful to tackle this other problem together, with whatever solution is best for this issue

Ah, so this this why we have both only-halo and owned-and-halo cell kernels? If this is a performance optimisation (as you perhaps indicate) then it should be done by using the 'redundant computation' transformation, not by modifying kernel metadata. Is that a solution @tommbendall ?

Nov 07 '24 13:11 arporter

I've included this in my long answer here: https://github.com/stfc/PSyclone/pull/2745#discussion_r1832863306

In summary, I think redundant computation and looping through OWNED_AND_HALO_CELL_COLUMNs will result in the same psy-layer code, and so there may not be need for the OWNED_AND_HALO_CELL_COLUMN metadata. However for some kernels it might be more natural to specify the OWNED_AND_HALO_CELL_COLUMN metadata rather than using redundant computation transformation, which might feel like a hack.

It's probably also worth mentioning that I don't think we currently have a redundant computation transformation that could do this.

Nov 07 '24 15:11 tommbendall