PSyclone [nemo] add support for ACCEnterDataTrans

Currently ACCEnterDataTrans is only supported for the GOcean API. We will need it for the NEMO API if we are to generate efficient OpenACC code.

Mar 01 '19 16:03 arporter

I've realised after talking with Rupert that this isn't going to be as easy as I first thought so I'm doing some manual experiments with enter/exit data. So far I have:

program data_test
  implicit none
  logical :: psy_data_region = .False.
  real, allocatable :: my_array(:,:)

  allocate(my_array(10,10))
  my_array(:,:)= 0.0

  call sub1()
  call sub1()

  call sub2()
  call sub2()

  write(*,*) my_array(1,1)

contains

  subroutine sub1
    implicit none
    !$acc enter data if(.not. psy_data_region) copyin(my_array)
    psy_data_region = .TRUE.
    !$acc kernels
    my_array(:,:) = my_array(:,:) + 1.0
    !$acc end kernels

  end subroutine sub1
  subroutine sub2
    implicit none
    !$acc exit data if(psy_data_region) copyout(my_array)
    psy_data_region = .FALSE.
    my_array(:,:) = my_array(:,:) + 1.0
  end subroutine sub2
end program data_test

I had to add the psy_data_region var as otherwise I get errors if the reference count for my_array goes below 0. The problem with this is that I have to know that my_array is on the GPU in order to copy it off so that it can be accessed on the CPU. Perhaps I can use acc_is_present(my_array) for this?

Dec 16 '21 17:12 arporter

This is going pretty well but I've realised that I don't currently handle Call nodes correctly. Currently we seem to identify all of the arguments as being read on the host (which, provided there are no expressions, they are not) and we don't recognise that they could have been written to on the GPU inside the call itself. Hence, we have to handle the passing of any variable to a call (other than a loop variable) as being a potential write on the GPU.

Dec 23 '21 12:12 arporter

The initial, simple solution to this is to always request that any data that is read on the CPU is pulled back from the GPU immediately beforehand. Similarly, any data that is written on the CPU is immediately pushed back to the GPU. With this strategy, the multi-kernel version of the tracer-advection benchmark compiles, runs and validates against the managed-memory version.

Dec 23 '21 15:12 arporter

Note that whiile I thought I could do !$acc update if(acc_is_present(my_var)) host(my_var) I got segmentation faults. However, it turns out that the update directive has the if_present clause which gives me precisely the functionality I wanted.

Dec 23 '21 15:12 arporter

@nmnobre this is the issue where I've been working on the explicit data transfers. The associated branch is 310_enter_data.

Jan 20 '22 16:01 arporter

It adds a new script (https://github.com/stfc/PSyclone/blob/310_enter_data/examples/nemo/scripts/kernels_explicit_data_mv_trans.py) that is the equivalent of kernels_trans.py but adds the data-movement directives too.

Jan 20 '22 16:01 arporter

Testing this with PSycloneBench/#80 reveals that handling local, automatic arrays with enter data doesn't work. Probably we should have create instead?

Mar 01 '22 11:03 arporter

From my perspective, these are the main outstanding issues:

Arrays of pointers, including allocatables, will need deep copies so the pointers are correctly attached. Fortunately, this has been supported since OpenACC 2.6 which means there's full support in the Nvidia compilers. Michael Wolfe has written a bit on the topic, see:
- Manual deep copies. Unfortunately, the implicit detach seems to have a bug... this was introduced somewhere between versions 21.1 and 21.7 of the Nvidia HPC SDK. I've filed this issue here;
- Full deep copies via -gpu=deepcopy. We can use this flag to circumvent the bug above but... Unfortunately, this solution is also not without problems. Indeed, whenever the derived type includes character arrays (in every case if larger than 16 bytes and, in some cases, even for smaller arrays - I suspect this happens when the compiler decides to have a pointer to said character array).
I wonder what happens when the target device is the multicore CPU running the host thread. Does the compiler add unnecessary copies from and to the same location (since memory is effectively shared)?
WRITE statements are rightly ignored for character literals, but we could safely do the same with all constant literals, arithmetical, logical and character;
Data updates after CPU writes are currently a bit too eager, could try to delay them till before a kernels construct or the end of the procedure.

Mar 11 '22 09:03 nmnobre

Manual deep copies. Unfortunately, the implicit detach seems to have a bug... this was introduced somewhere between versions 21.1 and 21.7 of the Nvidia HPC SDK. I've filed this issue here;

This issue has been fixed in version 22.5 of the Nvidia HPC SDK.

Jun 09 '22 08:06 nmnobre