PSyclone Add support for controlling data movement in OpenMP offload

We support both structured and unstructured data regions in OpenACC. It would probably be useful to have the equivalent in OpenMP offload.

Jun 28 '24 16:06 arporter

In order to use 'Unified Memory', the OpenMP standard mandates that a source file must contain !$omp requires unified_shared_memory. It turns out that the only compiler that doesn't require this is NVIDIA's and of course, that's the only one we've worked with in any detail so far. PSyclone therefore also needs some way of adding this directive to every source file it processes. It seems a bit clunky to do this as a Transformation but the alternative is then a command-line switch. What do @sergisiso, @hiker and @AidanChalk think?

Jul 01 '24 15:07 arporter

the OpenMP standard mandates that a source file must contain !$omp requires unified_shared_memory

I don't think this is exactly right, my understanding is that it is like an assert to check that the feature is supported by the compiler/platform https://www.openmp.org/spec-html/5.0/openmpse12.html

So it is a nice thing to have it (for safety, it will fail at compile-time instead of runtime) but it is not mandatory.

It turns out that the only compiler that doesn't require this is NVIDIA's

As commented above, this is not my understanding.

PSyclone therefore also needs some way of adding this directive to every source file it processes.

The problem with this is that we also need it to any file that PSyclone does not process, but touches that data. So enabling USM with the compiler flag is still necessary. So just to be clear, I think we should add it as a nice early-failure feature, but in practice won't change how we do things in NEMO.

Jul 01 '24 16:07 sergisiso

As I mentioned in mattermost, for Intel GPUs to work currently relevant "allocate" statements have to be replaced with "omp_target_alloc_shared"

Jul 01 '24 16:07 sergisiso

@sergisiso - it's more than an assert in that if the compiler does not support that directive, and you've written your program assuming USM, then you have a non-conforming OpenMP program and the behaviour is not specified - i.e., you cannot reason about what your program will do. See OpenMP 5.2 page 28 lines 17-19. So in that sense, it is mandatory, and the compiler is safely saying they can't compile your code correctly.

The NVIDIA compiler has a flag that replaces ALLOCATE calls with calls to cudaMallocManaged, so it works "for free" with or without the pragma as long as you include the special compiler flag to do that. If you forget, and forgot the pragma, then you're into incorrect code again. This is implementation defined behaviour.

My reading of OpenMP 5.2 is that Fortran allocatable arrays are implicitly mapped tofrom (OpenMP 5.2, page 149, line 4-5). There are some restrictions about assumed-size arrays though, but I'm too de-caffinated to parse them right now. Note that implicitly moving data like this is different from USM. Mapping has host and device copies of the data, with the later referenced counted (so, a map(to:) might not always copy to the device unless you use map(always, to:)). With USM, the model is that there is one and only one copy of the data (technically an implementation could cache it, just like any shared data can be caches, but this is there the memory coherency model kicks in).

omp_target_alloc_shared is an Intel extension, and not standard OpenMP (it's not in the latest TR 12 - the 6.0 preview). It is unclear if there is a Fortran equivalent, as the only documentation for it I can see is for C/C++. Intel's documentation also notes they use the requires unified_shared_memory directive for portability, which I guess means you can provide your own implementation to this vendor API to forward to a regular malloc if the API doesn't exist, which is valid if USM is active.

Jul 01 '24 17:07 tomdeakin

I agree with all you said @tomdeakin and I agree that we should try to add it to generate better output. I was just saying that this would not change what we could currently do with Intel PCV, because it will just say that USM is not supported. Even though with a proper compiler message rather than UB.

Jul 01 '24 17:07 sergisiso

PSyclone therefore also needs some way of adding this directive to every source file it processes. It seems a bit clunky to do this as a Transformation but the alternative is then a command-line switch.

It could also be done by the lowering of OMPTargetDirective (which can navigate up to its parent nodes). This could depend on an (Enum?) attribute on this node like "MemoryMode.USM"..

The problem with being a PSyIR Directive is that it needs to be in an executable region, and therefore the best we could do is to put it at the top of every subroutine. But instead we could use the less conventional option to make it a UnknownFortranType PSyIR Symbol with the string (#pragma require unified_shared_memory). With this we could put in on the Container or even File level and use the ST lookup to prevent duplicating it.

The reason for the attribute being an Enum is that we could also have MemoryMode.IMPLICIT_MAPPING and when OMPTargetTrans inserts it we could validate that its body has only things that have implicit mapping rules (e.g. if loop.walk(StructureReference): that I proposed in mattermost and any other Fortran data accessor/symbol interface without these rules defined) and raise TransformationError if not.

Finally we could have a MemoryMode.EXPLICIT_MAPPING which won't do any of the aforementioned checks, but could let the responsibility to adding map directive/clauses to other transformations applied by the user, or automatically also at lowering time.

Jul 03 '24 07:07 sergisiso

The problem with being a PSyIR Directive is that it needs to be in an executable region, and therefore the best we could do is to put it at the top of every subroutine

We could also add it as a specific directive and allow FileContainer to have a UnifiedMemoryDirective as child 0 or something.

Sep 17 '24 09:09 LonelyCat124