PSyclone [PSyData] Tracing writes to variables

While debugging various LFRic crashes, we found ourselves often in the position that we know a variable has incorrect values (not necessarily NAN, but e.g.0, or just garbage), and we need to find out if and where this variable was written. Note that this cannot be done with simple NAN testing (since it's not necessarily NANs/infinity).

We also discussed in the past (can't find a ticket for it atm) an potential improvement of NAN testing by allowing to specify a range of valid values for a variable in an invoke, e.g. using specific environment variable: PSYDATA_INVOKE_ABC_VAR_XYZ=0:100 to indicate that the variable xyz in the invoke abc should be between 0 and 100. According to feedback from my team, this would not help with the issue at hand, since a) they don't know the function where the variable is written incorrectly and b) they wouldn't even know the expected range.

A suggested solution that would help with some restrictions:

we keep track of the address of the first element in each field at runtime. This would be a unique value, even if the field is passed and renamed across several layers of subroutine calls (esp. using ptr somewhere in between). (*)
we use the dependency analysis to log the address of each variable written, and variable read for a kernel - together with region information and variable name (which PSyData has).
As proof-of-concept: we can just dump this to a file/stderr.

If then during debugging we find a field with unexpected values, we look in this log file for the last usage of this variable and the invoke name. We get the address of this variable, and look back in the log file for when this variable was written before - and we get the region name and variable name. So we automatically know where to check next.

Potential problems/restrictions:

if a variable is a temporary variable, the memory could be reused (e.g. in subroutine A we declare a field, initialise it (which gets logged), and free the memory. Then in an unrelated subroutine B we declare new field, which accidentally gets assigned to the same memory. This variable is not initialised but used. When we look up using the address, we would get the initialisation of the previous temporary variable. Code inspection would quickly diagnose what is happening (since the previous write is to a local field only).
- This could be properly fixed by an improvement of the field type in the infrastructure library: using a class variable each field could be given a unique number!
It would very likely not work for scalar numbers (not sure if we can have scalar numbers as output or a kernel??). While Fortran should pass by reference, chances are that we have several layers of subroutines with local variables.
- I would accept this as a known restrictions, since typically the problems are in fields.

Long term vision: we improve the PSyDATA API to allow 'cross triggers': e.g. NAN testing could trigger other PSyData code. So if we have a NAN, we could then trigger tracing psydata (which would automatically find the previous write of this field). Or we trigger kernel extraction to create a test case. That's really long term.

I am trying a proof-of-concept of just printing the addresses of fields, and we will evaluate if this is useful in our debugging process.

Jul 19 '22 02:07 hiker

I've implemented a prototype that just prints out accesses and addresses. Assuming that we fix builtins, the following example:

    call invoke( name = 'Initialise fields',        &
                 setval_c( field1,     0.0_r_def ), &
                 setval_c( field2,     1.0_r_def )  &
                 )
    call invoke( name = 'testkern_w0', testkern_w0_kernel_type(field1, field2) )

produces:

 Tracing write:field1       94772378048832  module main_psy region invoke_initialise_fields:setval_c:r0
 Tracing write:field2       94772378053152  module main_psy region invoke_initialise_fields:setval_c:r1
 Tracing read :field1       94772378048832  module main_psy region invoke_testkern_w0:testkern_w0_code:r2
 Tracing read :field2       94772378053152  module main_psy region invoke_testkern_w0:testkern_w0_code:r2
 Tracing read :map_w0       94772378044508  module main_psy region invoke_testkern_w0:testkern_w0_code:r2
 Tracing write:field1       94772378048832  module main_psy region invoke_testkern_w0:testkern_w0_code:r2

If for example we know that field1 (uniquely identified by the address, since the name might depend on the naming the user has used in a subroutine) in invoke_testkern is invalid, this trace immediately shows us that this variable was written to in invoke_initialise.

An unexpected disadvantage: LFRic implicit variables (like map_w0) show up as read only, without initialisation :( There are a few ways to handle this:

We declare this a known issue, since most implicit variable names should be well known (I am lazy :) )
We don't instrument implicit variables (which would need an option to the access collection code to ignore the implicit variables. We would need this likely anyway if we want to use the new access info method to determine halo exchanges)
We use a special version of field_mod.f90, which would set the field as mapped when a field is created. While this is a bit of a pain, it's not the first time we would like to have an extension of field_mod (the other was tracing if fields have been written to).

Jul 19 '22 07:07 hiker

Example of gungho:

 Tracing write:u       94075760869376  module semi_implicit_timestep_alg_mod_psy region invoke_8:setval_x:r17
 Tracing read :u       94075760869376  module conservation_algorithm_mod_psy region invoke_compute_conserved_quantities:compute_total_energy_code:r2
 Tracing read :u       94075760869376  module conservation_algorithm_mod_psy region invoke_compute_conserved_quantities:compute_total_aam_code:r4
 Tracing read :field3       94075760869376  module checksum_alg_mod_psy region invoke_2:x_innerproduct_x:r2

You can see that there are different names (u and field3) for the same address. After some searching this is confirmed to be correct:

 ./driver/gungho_model_mod.F90:      call checksum_alg(program_name, rho, 'rho', theta, 'theta', u, 'u', &
...
  SUBROUTINE checksum_alg(model_name, field1, name1, field2, name2, field3, name3, field4, name4, field_bundle, bundle_name)
...
    IF (PRESENT(field3)) CALL invoke_2(chksum3, field3)
...
    SUBROUTINE invoke_2(chksum3, field3)
...
      DO df=loop0_start,loop0_stop
        l_chksum3(1,th_idx) = l_chksum3(1,th_idx) + field3_proxy%data(df)*field3_proxy%data(df)
      END DO

Jul 19 '22 10:07 hiker

PSyclone PSyclone copied to clipboard

[PSyData] Tracing writes to variables

PSyclone
PSyclone copied to clipboard