PSyclone Avoid the need for redundant computation transformations with annexed dofs

The LFRic annexed dofs switch can reduce the required number of halo exchanges significantly and is therefore a useful option. However, it can lead to uninitialised data being read. Whilst this uninitialised data is only ever read in cases where the result is not used for valid field values i.e. it does not affect the correctness of the computation, it can lead to ieee errors or similar.

The cases are when a new field is initialised when iterating over dofs. The two builtins that do this are setval_c and setval_x. In this case the owned and annexed dofs are initialised but the rest of the l1 halo is not. If this field is then subsequently accessed in a kernel with a gh_inc access and the kernel iterates over cell columns, then the field's owned, annexed and l1 halo values will be read. This case does not cause a halo exchange as the annexed dofs switch allows us to assume that the owned and annexed dofs are valid on entry.

The current solution is to apply redundant computation for both setval_c and setval_x. This is OK for setval_c as it simply writes to the field so does not cause any additional halo exchanges. However setval_x could require an additional halo exchange as the field that it reads may have dirty halos. In practice the Met Office have found that applying redundant computation to setval_x does result in extra halo exchanges and that it not actually required in their current implementation.

The current solution is not a good one as the annexed dofs switch is not safe without redundant computation transformations and the application of redundant computation transformations may result in more halo exchanges than necessary.

The proposed solution is to add additional run-time logic to a field that allows PSyclone to set a new halo exchange state indicating that a halo is uninitialised after a setval_c or setval_x builtin. Then, before any kernel that writes to a continuous field, all fields that are read must have code that checks this flag and then initialises the halos to appropriate values. This logic is obviously very similar to the existing halo exchange logic and may be able to be incorporated into it.

The values to use to initialise the data are interesting. In theory we do not care as the computed values will not be used. However, if we simply set the values to a constant value then we still may have ieee errors (e.g. we could get divide by zero or underflow). The safe solution is to do a halo exchange at this point. A solution that avoids a halo exchange is to copy existing valid data from internal values into the halo. There is still a risk here as e.g. there may be masked data, or data may differ significantly over the domain, but in practice this could be a good optimisation option to have.

Mar 17 '23 09:03 rupertford

@christophermaynard has asked for an explanation of the problem so here we go ...

If we start with the following example:

call invoke( &
    kern1(f1),  &
    kern2(f1))

where f1 is on a continuous function space and the access to f1 in kern1 and kern2 is gh_inc, the resultant halo exchanges when the annexed_dofs switch is false in pseudo-code are:

if dirty: hex(f1)
loop:
    kern1(f1)
loop:
    kern2(f1)

Notice there is no halo exchange added between kern1 and kern2. This is because a gh_inc access only requires a field's owned and annexed values to be clean. PSyclone knows that this is the case due to the previous access to f1 in kern1 so does not add a halo exchange. I refer to this as Tom's optimisation as he first explained it to me when we were in Australia in a pub. However, PSyclone does not know this for the access to f1 in kern1 as it does not know what has happened before this invoke, i.e. f1 could have dirty annexed dof values.

When the annexed_dofs switch is true and applied across all invokes in the codebase (hence the configuration switch) we know that annexed dofs are always clean. This means that no halo exchange for f1 is needed before kern1:

loop:
    kern1(f1)
loop:
    kern2(f1)

This is one of the optimisations that annexed_dofs allows, reducing the number of halo exchanges added to the code and resulting in a significant reduction in the number of active halo exchanges in LFRic.

I will assume you know why annexed_dofs being true allows us to assume that annexed dofs are always clean.

Let's now add a setval_c. Note that this could be in an earlier invoke and PSyclone would not know about it, but in this case we will place it in the same invoke as kern1 and kern2:

call invoke( &
    setval_c(f1), &
    kern1(f1),  &
    kern2(f1))

With annexed_dofs set to true, we know that f1's annexed dofs will always be clean. It is obvious here as setval_c will write to owned and annexed dofs and we know that kern1 and kern2 both result in clean owned and annexed dofs. Therefore no halo exchanges are required.

However, if f1 is a new field then only its owned and annexed dofs will have been written to by setval_c, but in the subsequent gh_inc access to f1 in kern1, its owned and L1 halo (including annexed dofs) will be read then written. This results in a read of data that has not been touched.

So that is the problem. We don't want to add a halo exchange before each gh_inc access as that removes a large benefit of the annexed_dofs option, but we also do not want to read uninitialised data.

The issue is new fields where the kernel/builtin only writes to owned and annexed dofs. This is only an issue for kernels (builtins) that operate on dofs and the only kernels/builtins that write to a field are setval_c and setval_x. This is why making both of these builtins compute redundantly to the l1 halo using the redundant computation transformation fixes the problem. The problem with doing this for setval_x is that it introduces a possible halo exchange before the builtin which has been found to increase the number of halo exchanges in LFRic quite a bit, all in cases where they are not required.

Mar 29 '23 20:03 rupertford

The proposed change could allow us to get rid of a some calls to setval_c from the code as they would be replaced with auto-generated initialisation of just the l1 halo and annexed dofs.

Currently, if a new field has a built-in applied (e.g. X_divideby_Y(new_field, x_val, y_val) ) then the halos of the new field would be uninitialised causing a potential failure should new_field be passed as a GH_INC argument. A setval_c is required prior to the built-in call. With the proposed change, auto-generated code could set just the halo values after the builtin call and before the halo call.

Scientists would have to identify which setval_c calls are no longer required. But if they remove ones that are still needed, the numerical checking tests should pick them up.

With regard to the options for initialising the halo, I suggest that the kernel metadata can be updated to permit a kernel to define an initial value for the likely unusual cases where defaulting to 0.0 may lead to a divide-by-zero or other numerical error.

Apr 02 '24 14:04 stevemullerworth