PSyclone (Closes #2716) Add support for module-inlining calls to polymorphic kernels/routines

Oct 03 '24 09:10 arporter

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 99.91%. Comparing base (e8f0021) to head (00c36fd). Report is 131 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #2732   +/-   ##
=======================================
  Coverage   99.91%   99.91%           
=======================================
  Files         364      364           
  Lines       51697    51781   +84     
=======================================
+ Hits        51652    51739   +87     
+ Misses         45       42    -3

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Oct 03 '24 13:10 codecov[bot]

As well as addressing the coverage, I need to extend the LFRic integration tests to use this new functionality.

Oct 03 '24 14:10 arporter

Added the KernelModuleInlineTrans back into the LFRic transformation script and get a crash when trying to add the interface symbol into the symbol table.

Oct 07 '24 16:10 arporter

Build and run with updated transformation script works: This doesn't seem very different from the profile of the version before this change so I need to check exactly what PSyclone is making of things.

Oct 08 '24 14:10 arporter

If you merge with master, I added multiple prints in the output, so you can grep and count the number of each cases that we could not get the kernel_schedule before due to polymorphic kernels, which was 331

Oct 08 '24 14:10 sergisiso

The OMP offload LFRic integration test failed with an ICE:

17:47:36 Pre-process and compile inventory/id_r32_field_array_pair_mod.F90
inventory/id_r32_field_array_pair_mod.F90:
NVFORTRAN-S-0000-Internal compiler error. flowgraph: node is zero       3  (lfric_xios_setup_mod_psy.f90: 85)
NVFORTRAN-F-0000-Internal compiler error. Invalid key for hash       0  (lfric_xios_setup_mod_psy.f90: 85)
NVFORTRAN/x86-64 Linux 24.5-1: compilation aborted

It's seems odd that there's a _psy.f90 for xios setup??

EDIT: L85 of that file corresponds to the end of the (module-inlined?) nodal_coordinates_code kernel (subroutine) which is called from within an offloaded region. However, the routine itself does not contain an offload directive?

Oct 08 '24 20:10 arporter

Do the build manually for OpenACC (which doesn't give an ICE) and things are looking better: (note: I've marked MATMUL as being available on the GPU now.) I think @sergisiso has seen that module-inlining a routine is sufficient for the compiler to do the right thing, even if it's not had e.g. acc routine added to it?

Oct 08 '24 21:10 arporter

I need to figure out why we get an inlined routine without an offload directive added to it. The build log says:

PSy name = 'lfric_xios_setup_mod_psy'
Transforming invoke 'invoke_0_nodal_xyz_coordinates_kernel_type' ...
Failed to annotate 'nodal_xyz_coordinates_code' with GPU-enabled directive due to:
Transformation Error: Kernel 'nodal_xyz_coordinates_code' accesses the symbol 'chi2xyz: RoutineSymbol<NoType, pure=unknown, elemental=unknown>' which is imported. If this symbol represents data then it must first be converted to a Kernel argument using the KernelImportsToArguments transformation.
Failed to annotate 'nodal_xyz_coordinates_code' with GPU-enabled directive due to:
Transformation Error: Kernel 'nodal_xyz_coordinates_code' accesses the symbol 'chi2xyz: RoutineSymbol<NoType, pure=unknown, elemental=unknown>' which is imported. If this symbol represents data then it must first be converted to a Kernel argument using the KernelImportsToArguments transformation.
Transforming invoke 'invoke_1_nodal_coordinates_kernel_type' ...
Failed to module-inline kernel 'nodal_coordinates_code' due to:
Transformation Error: Cannot inline subroutine 'nodal_coordinates_code' because another, different, subroutine with the same name already exists and versioning of module-inlined subroutines is not implemented yet.
Successfully offloaded loop with ['nodal_coordinates_code']
Successfully offloaded loop with ['nodal_coordinates_code']
Transforming invoke 'invoke_2_nodal_coordinates_kernel_type' ...
Failed to module-inline kernel 'nodal_coordinates_code' due to:
Transformation Error: Cannot inline subroutine 'nodal_coordinates_code' because another, different, subroutine with the same name already exists and versioning of module-inlined subroutines is not implemented yet.
Successfully offloaded loop with ['nodal_coordinates_code']
Successfully offloaded loop with ['nodal_coordinates_code']

Oct 08 '24 21:10 arporter

Part of the problem was that the optimisation script wasn't checking that it hadn't already transformed a given kernel for a given invoke. Fixing that removes the 'failed to inline' messages but we still end up with an inlined kernel that doesn't have a directive added to it.

Oct 09 '24 08:10 arporter

The problem was that, having successfully module-inlined the kernel routine, we proceed to apply the annotation transformation to the original Kern and that does not update the Routine that has been inlined. It probably should? For now, I can search for the newly inlined routine and apply the transformation to that and we get the expected code.

Oct 09 '24 08:10 arporter

Now get a compilation failure:

NVFORTRAN-S-0155-Ambiguous interfaces for generic procedure matrix_vector_code (
algorithm/norm_alg_mod_psy.f90: 338)

This seems to be because we have successfully inlined this interface and its routines but subsequent PSy-layer subroutines are still importing it from a module. In turn, this is because I now check whether or not we've already transformed a kernel of a given name. Essentially, we need multiple LFRicKern objects to all point to the same bit of PSyIR.

Oct 09 '24 08:10 arporter

Kern.get_kernel_schedule() (which is to be replaced/removed if/when we migrate Kern to subclass Call) currently caches the PSyIR of the kernel. KernelModuleInlineTrans creates a copy of this PSyIR and then inserts it into the PSy-layer. Therefore, get_kernel_schedule() should now return that copy of the PSyIR. In fact, that would happen automatically if I undid my changes to KernelModuleInlineTrans so that it doesn't copy the PSyIR. I can't remember why I made that change.

I think I made that change because, without copying, a routine gets removed from its original Container but that breaks any Interface that refers to it. This becomes a problem in the (unlikely) event that a routine is called directly as well as via an interface.

Oct 09 '24 09:10 arporter

If we have an Algorithm layer that contains two invokes that each call the same kernel then we attempt to apply KernelModuleInlineTrans to each kernel call. If the first one succeeds then we immediately (in global.py) proceed to add e.g. acc routine to it. That done, we move on to the next invoke where we try to do the same thing. We then find that the body of the routine we want to inline does not match the one we've already inlined because we've added acc routine. This results in:

The second kernel is not flagged as being module inlined and thus is handled by _rename_and_write.
The result of _rename_and_write is a renamed kernel for which the PSy layer does not contain a symbol.

This shouldn't happen. However, we also want to mark all kernels of the same name as being module-inlined and pointing to a single implementation. Since global.py works invoke by invoke, it's not simple (or natural) to alter it so that it works kernel by kernel.

Oct 10 '24 09:10 arporter

Thinking about this a bit more, I belatedly realise that if a given PSy layer routine calls the same kernel more than once then either all of them must be module inlined or none of them (unless we attempt to rename the inlined version and that gets complicated). I think the reason that LFRic is still not working is that we fail to inline a second instance of the same kernel (because the first instance has been transformed) but then we do proceed to transform it. Therefore, rename_and_write() is called in order to generate a new version of that kernel on disk. I need a reproducer for this.

Oct 10 '24 13:10 arporter

I've realised that Kern.module_inline sets this flag for all Kernels with that name in the current InvokeSchedule. This seems sensible but is causing me problems because we end up with a Kernel marked as 'module inlined' but which actually isn't because the transformation on it failed and so it was written to file and renamed instead.

The solution might therefore be as simple as not attempting to transform a Kernel that is already marked as module inlined.

Oct 11 '24 09:10 arporter

The trouble with the LFRic example is that we have two, separate invoke calls, each with the same Kernel. The first Kernel gets module inlined but has to be modified to have some imports made local to it. That then means we refuse to inline the second Kernel because it is no longer identical. The question is, why does the search for the source of the second kernel not return the module-inlined source of the first one?

Oct 11 '24 11:10 arporter

With this branch and latest gpu script in lfric/examples:

Nov 18 '24 14:11 arporter

Adding RANDOM_NUMBER intrinsic to available_on_device - build still worked OK on GPU but profile didn't look much different. Need to update the Sankey diagram and see whether I've missed anything.

Nov 18 '24 15:11 arporter

One of the compilation tests fail with the nvidia compiler. This same test passes with gfortran:

Nov 19 '24 10:11 arporter

$ nvfortran -I../dynamo_wrapper0/utilities -I../dynamo_wrapper0/field -I../dynamo_wrapper0/operator -I../dynamo_wrapper0/function_space -I../dynamo_wrapper0/mesh -I../dynamo_wrapper0/scalar -I../dynamo_wrapper0/configuration -c psy.f90
NVFORTRAN-S-0155-Ambiguous interfaces for generic procedure mixed_code (psy.f90: 289)

Nov 19 '24 11:11 arporter

The problem is that the invoke subroutine is still importing the kernel interface, even though it has been module inlined:

    SUBROUTINE invoke_1(scalar_r_phys, a, istp)
      USE mixed_kernel_mod, ONLY: mixed_code

EDIT: I think this is correct. We've only module-inlined the mixed_kernel for invoke_0, not invoke_1. The latter still uses the imported version of the kernel. Is this compliant with the Fortran standard though?

Nov 19 '24 13:11 arporter

I'm beginning to think that this is a compiler bug. If I alter the second invoke so that it calls a specific implementation of the kernel (e.g. mixed_code_64) then it compiles fine. I've made a reproducer and sent it to Lukas and Wayne.

Nov 19 '24 16:11 arporter

This is ready for a first review. It's quite big but can wait until 3.0 is out of the door. Probably needs to be reviewed by @sergisiso.

Nov 22 '24 10:11 arporter

In experimenting with this branch, I discovered a bug: the routines that a PUBLIC Interface maps to may themselves be declared PRIVATE in their containing module! Unfortunately, in what I've written I've assumed they must be public.

Nov 28 '24 10:11 arporter

Trying this updated branch with LFRic (and my enhanced global.py) I now get:

Failed to module-inline routine coordinate_jacobian:
Transformation Error: routine 'coordinate_jacobian' contains accesses to 'jacobian_abr2xyz' which is declared in the same module scope. Cannot inline such a routine.

where jacobian_abr2xyz is actually a routine in the same module so we could inline it...

Also get

File '../lfric/components/science/source/kernel/geometry/chi_transform_mod.F90' does contain module 'chi_transform_mod' but PSyclone is unable to create the PSyIR of it.

This is possibly because that module contains an unsupported form of interface definitions:

  INTERFACE
    SUBROUTINE chi2xyz_interface(chi_1, chi_2, chi_3, panel_id, x, y, z)
      IMPORT :: r_def, i_def
      IMPLICIT NONE
      INTEGER(KIND = i_def), INTENT(IN) :: panel_id
      REAL(KIND = r_def), INTENT(IN) :: chi_1, chi_2, chi_3
      REAL(KIND = r_def), INTENT(OUT) :: x, y, z
    END SUBROUTINE chi2xyz_interface
  END INTERFACE

Nov 28 '24 17:11 arporter

Attempt to do recursive inlining:

Failed to inline call jacobian_abr2xyz(alpha, beta, radius, panel_id):
Transformation Error: Routine 'jacobian_abr2XYZ_r_single' cannot be inlined because it accesses variable 'r_single' from its parent container.

but r_single is actually named in an import into the parent container => should be easy to handle.

...well, ish. It's a bit messy because there's no clean way to spot all the Symbols a bit of code uses and I don't want to duplicate what validate() does. Maybe we can use VAI?

Nov 28 '24 17:11 arporter

Got a crash while processing init_saturated_profile_alg_mod.x90:

  File "/home//Projects/PSyclone/src/psyclone/domain/lfric/lfric_loop.py", line 895, in gen_code
    super().gen_code(parent)
    ~~~~~~~~~~~~~~~~^^^^^^^^
  File "/home//Projects/PSyclone/src/psyclone/psyir/nodes/loop.py", line 581, in gen_code
    kernel.rename_and_write()
    ~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/home//Projects/PSyclone/src/psyclone/psyGen.py", line 1664, in rename_and_write
    self._rename_psyir(new_suffix)
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "/home//Projects/PSyclone/src/psyclone/psyGen.py", line 1741, in _rename_psyir
    kern_schedule.name = new_kern_name[:]
    ^^^^^^^^^^^^^^^^^^
  File "/home//Projects/PSyclone/src/psyclone/psyir/nodes/routine.py", line 419, in name
    sym = self.symbol_table.lookup(symbol.name)

because the Routine instance doesn't yet have a symbol table.

Apr 10 '25 13:04 arporter

Fixed that and now get failure when doing:

LFRIC_OFFLOAD_DIRECTIVES=omp psyclone -api lfric -d /Projects/LFRic/spack_gfortran/lfric_apps_7536/applications/gungho_model/working/build_gungho_model -s applications/gungho_model/optimisation/psyclone-test/global.py /Projects/LFRic/spack_gfortran/lfric_apps_7536/applications/gungho_model/working/build_gungho_model/algorithm/initialisation/init_saturated_profile_alg_mod.x90

because the kernel schedule obtained in CodedKern._rename_and_write() has somehow been detached from its parent container.

Apr 10 '25 14:04 arporter

I suspect the problem is that we have module-inlined the kernel and broken something along the way.

Module-inlined kernel 'initial_theta_code'
Failed to annotate 'initial_theta_code' with GPU-enabled directive due to:
Transformation Error: Kernel 'initial_theta_code' accesses the symbol 'analytic_temperature: RoutineSymbol<UnresolvedType, pure=unknown, elemental=unknown>' which is imported. If this symbol represents data then it must first be converted to a Kernel argument using the KernelImportsToArguments transformation.

Apr 10 '25 20:04 arporter

Further investigation shows that it is when generating the code for invoke_5 that we get the failure. When processing this invoke I see:

Transforming invoke 'invoke_5' ...
setting _parent None
setting _parent None
setting _parent None
setting _parent Container[init_saturated_profile_alg_mod_psy]

setting _parent None
setting _parent Container[init_saturated_profile_alg_mod_psy]

Failed to module-inline kernel 'theta_e_code' due to:
Transformation Error: Kernel 'theta_e_code' cannot be module inlined into Container 'init_saturated_profile_alg_mod_psy' because a *different* routine with that name already exists and versioning of module-inlined subroutines is not implemented yet.
Annotated kernel 'theta_e_code'

Apr 11 '25 11:04 arporter