Summary

This PR adds support for a parameterizable RAJA::forall interface that takes optional "plugin-like" objects to extend the execution behavior of a RAJA::forall context.

This PR includes support for Reduce and KernelName ForallParam Types.
Tests and example exercises are included.
Additional CI / namespace changes are listed at the bottom.

RAJA::Forall Interface & ForallParam Objects

Each operational parameter passed to RAJA::forall will be referred to as a "ForallParam" object, as they inherit RAJA::expt::ForallParamBase.

All work for the new API lives adjacent to the current RAJA::forall implementation under the RAJA::expt::... namespace.
The experimental RAJA::forall implementation is called implicitly when a user passes a RAJA::expt::<ForallParam> object as an argument. If none are provided, the current RAJA::forall implementation is used.

Reduce
- RAJA::expt::Reduce takes a target variable to write the final result to (rs,rm).
- It passes a corresponding argument to the RAJA lambda to be used as the local instance of the target(_rs,_rm).
- The local variable is initialized with the "identity" of the reduction operation to be performed.
- A reduction is performed implicitly by the RAJA::forall across thread copies of the local variable.
- Finally, the reduction operation is performed against the original value of the target and the result of the reduction.
- The final value can be returned simply be referencing the target variable.
```
double rs;
double rm;

RAJA::forall<EXEC_POL> ( Res, Seg, 
RAJA::expt::Reduce<RAJA::operators::plus>(&rs),
RAJA::expt::Reduce<RAJA::operators::minimum>(&rm),
[=] (int i, double& _rs, double& _rm) {
  _rs += ...
  _rm = RAJA_MIN(..., _rm); 
}
);

std::cout << rs ...
std::cout << rm ...
```
ValLoc Reductions
- ValLoc reductions require the use of RAJA::expt::ValLoc<T> types. Since they are strongly typed they provide min() and max() operations. Users must also use getVal() and getLoc to return results.
```
using VL_INT = RAJA::expt::ValLoc<int>;
VL_INT rm_loc;

RAJA::forall<EXEC_POL> ( Res, Seg, 
RAJA::expt::Reduce<RAJA::operators::minimum>(&rm_loc),
[=] (int i, VL_INT& _rm_loc) {
  _rm_loc.min(...);
  // _rm_loc = RAJA_MIN(..., _rm_loc);  // ---> This essentially does the same as _rm_loc.min(...)
}
);

std::cout << rm_loc.getVal() ...
std::cout << rm_loc.getLoc() ...
```
KernelName

RAJA::expt::KernelName Allows CUDA and HIP RAJA::forallcalls to be wrapped with a (nvtx/rocm)Range, this is similar to our RAJA::Launch capabilities.
```
RAJA::forall<EXEC_POL> ( Res, Seg, 
RAJA::expt::KernelName("MyFirstRAJAKernel"),
[=] (int i) {
  ...
}
);
```
Note : Not all ForallParam objects can/need to pass arguments to the lambda.

Lambda Arguments

This interface takes advantage of C++ parameter packs to allows users to define any number of ForallParams in their RAJA::forall calls. It also allows for the use of a variety of different ForallParam types.

using VL_INT = RAJA::expt::ValLoc<int>;
VL_INT rm_loc;
double rs;
double rm;
      
RAJA::forall<EXEC_POL> ( Res, Seg, 
  RAJA::expt::Reduce<RAJA::operators::plus>(&rs),        // --> 1 double added
  RAJA::expt::Reduce<RAJA::operators::minimum>(&rm),     // --> 1 double added
  RAJA::expt::Reduce<RAJA::operators::minimum>(&rm_loc), // --> 1 VL_INT added
  RAJA::expt::KernelName("MyFirstRAJAKernel"),           // --> NO args added
  [=] (int i, double& _rs, double& _rm, VL_INT& _rm_loc) {
    _rs += ...
    _rm = RAJA_MIN(..., _rm); 
    _rm_loc.min(...);
  }
);

std::cout << rs ...
std::cout << rm ...
std::cout << rm_loc.getVal() ...
std::cout << rm_loc.getLoc() ...

The lambda arguments are passed in the same respective order to that of the ForallParams. Both the types and number of arguments are required to be correct in order to compile successfully otherwise a static assertion will be triggered:

LAMBDA Not invocable w/ EXPECTED_ARGS.

Inspecting the LAMBDA and EXPECTED_ARGS types in the error will show you the issue.

Implementation

The top layer interface of RAJA takes all ForallParams and the lambda as a single parameter pack.
It extracts the lambda from the pack and creates a RAJA::expt::ForallParamPack object to pass around the ForallParams.
If the ForallParamPack is empty the current RAJA::forall implementation is called. If not, the expt implementation is used.

// Old patter/forall.hpp
forall(ExecPol&& p, Res r, Container c, LoopBody&& loop_body) {
  ...
  wrap::forall(r, p, c, body);
  ...
}

// New pattern/forall.hpp
forall(ExecPol&& p, Res r, Container c, Params... params) {
  auto f_params = RAJA::expt::make_forall_param_pack(params...);
  auto loop_body = RAJA::expt::get_lambda(params...);
  ...
  wrap::forall(r, p, c, body, f_params); // --> Underlying calls use SFINAE to check if f_params is empty.
  ...
}

ForallParamPack

The RAJA::expt::ForallParamPack (FPP) type is an internal tuple-like object used to pass around the set for ForallParams given to RAJA. It is found at include/RAJA/pattern/params/forall.hpp.

It is responsible for generating the list of lambda arguments.
Defines friend functions used by RAJA::expt::ParamMultiPlexer.
A Set of generator functions is provided, empty FPP, FPP from tuples, FPP from template argument pack.
Type traits for SFINAE on RAJA::forall_impl methods.

ParamMultiplexer

RAJA::expt::ParamMultiPlexer (PMP) is used to execute the required work for each ForallParam at specific locations directly in the new RAJA::forall_impl methods. When PMP methods are called they subsequently call the respective function for each ForallParam object in a FPP. The calls to these methods are generated at compile time in a fashion similar to c++17 fold expressions.

// Possible of implementation of  RAJA::forall_impl w/ OpenMP
forall_impl( EXEC_POL, ITER iter, FUNC body, FPP f_params) {
  ...
  RAJA::expt::ParamMultiplexer::init<EXEC_POL>(f_params); // --> Calls init of each ForallParam 
  
  #pragma omp declare reduction( \
    combine \
    : decltype(f_params) \
    : RAJA::expt::ParamMultiplexer::combine<EXEC_POL>(omp_out, omp_in) ) // --> Calls combine of each ForallParam

  #pragma omp parallel for reduction(combine: f_params)
  for (...) {
    RAJA::expt::invoke_body(f_params, body, iter[i]);
  }
  
  RAJA::expt::ParamMultiplexer::resolve<EXEC_POL>(f_params); // --> Calls resolve() of each ForallParam
}

ForallParam policy specialization

The PMP helps to call out to the list of ForallParam objects in the FPP. Therefore, each ForallParam type must define the operation to do at each of those methods.

Currently init, combine and resolve are required to be defined by each ForallParam type.
- Let's call these "FP Methods"...
The execution policy of the RAJA::forall is passed through the FP Methods allowing us to specialize the work for each RAJA execution policy / backend.
- This specialization allows this new interface to be easily portable and maintainable across different platforms.

// Specialized OpenMP **Reduce** Implementation
init(Reducer<OP, T>& red){
  red.val = OP::identity();
}
...
combine(Reducer<OP, T>& out, const Reducer<OP, T>& in) {
  out.val = OP{}(out.val, in.val);
}
...
resolve(Reducer<OP, T>& red) {
  *red.target = OP{}(red.val, *red.target);
}

The FP Method parameters can also be overloaded for each backend. Cuda requires some device info for init and only requires one reducer argument for combine...

// Specialized CUDA **Reduce** Implementation
init(Reducer<OP, T>& red, cudaInfo& cs){ 
  // Initialize memory on the device for the reduction
}
...
combine(Reducer<OP, T>& red) {
  RAJA::cuda::impl::expt::grid_reduce(red);
}
...
resolve(Reducer<OP, T>& red) {
  // Copy result back to host
  *red.target = OP{}(red.val, *red.target);
}

If certain FP Methods are not needed they can also no-op. RAJA::expt::KernelName does not need to execute anything for combine.

// Specialization for CUDA **Kernel_Name**
init(KernelName& kn){
  nvtxRangePush(kn.name);
}
...
combine(KernelName&) {
  // No-Op.
}
...
resolve(KernelName&) {
  nvtxRangePop();
}

New Reducers Application Integration

To use the new Reducer interface applications can port over their code piece by piece. Meaning both reduction interfaces will work correctly together in a single RAJA::forall call. However performance gains for OpenMP will likely not be achievable until the RAJA kernel is entirely ported to the new interface.

Testing and Examples/Exercises

This PR provides duplicate tests to all of those we currently perform for Reduction in RAJA::forall.

Prototype and Future Work

The ForallParam RAJA::forall interface is quite extensible, additional operations and features can be added in the future relatively easily. We have a working (OpenMP) prototype for developing ReducerArray functionality.

double rs;
double array[arr_sz];
      
RAJA::forall<EXEC_POL> ( Res, Seg, 
  RAJA::expt::Reduce<RAJA::operators::plus>(&rs),
  RAJA::expt::ReduceArray<RAJA::operators::plus>(array, arr_sz),
  [=] (int i, double& _rs, double* _arr) {
    _rs += ...
    arr[1] += ...
    arr[3] += ...
  }
);

std::cout << rs ...
std::cout << array[3] ...

Other Changes

Lassen gitlab CI has bumped nvcc versions to 11.1.0 @adrienbernede
The invisible alias for the epxt namespace has been removed as some sections of this PR declare the namspace RAJA::expt::detail. This confuses MSVC and breaks our Windows builds. MSVC seem to be unable to find symbols that live in RAJA::detail or any other code where we have both RAJA::expt::<Name> and RAJA::<Name>.

Mar 29 '21 20:03 mdavis36

Hi @mdavis36, I'm getting the same error when building CUDA and HIP (after applying the HIP changes), maybe we're missing a commit?

/g/g20/chen59/work/RAJAnewreduction/rajamikereduction2/new_reduce/forall_param.hpp(118): error #304: no instance of overloaded function "detail::ForallParamPack<Params...>::lambda_args [with Params=<detail::Reducer<RAJA::operators::plus<double, double, double>, double>, detail::Reducer<RAJA::operators::minimum<double, double, double>, double>, detail::Reducer<RAJA::operators::maximum<double, double, double>, double>>]" matches the argument list

        argument types are: (camp::make_idx_seq_t<3L>)

      detected during:

        instantiation of class "detail::ForallParamPack<Params...> [with Params=<detail::Reducer<RAJA::operators::plus<double, double, double>, double>, detail::Reducer<RAJA::operators::minimum<double, double, double>, double>, detail::Reducer<RAJA::operators::maximum<double, double, double>, double>>]"

/g/g20/chen59/work/RAJAnewreduction/rajamikereduction2/new_reduce/openmp/forall.hpp(11): here

        instantiation of "std::enable_if_t<std::is_same<EXEC_POL, RAJA::policy::omp::omp_parallel_for_exec>::value, void> detail::forall_param(EXEC_POL &&, int, const B &, Params...) [with EXEC_POL=RAJA::policy::omp::omp_parallel_for_exec, B=lambda [](int, double &, double &, double &)->void, Params=<detail::Reducer<RAJA::operators::plus<double, double, double>, double>, detail::Reducer<RAJA::operators::minimum<double, double, double>, double>, detail::Reducer<RAJA::operators::maximum<double, double, double>, double>>]"

(153): here

        instantiation of "void forall_param<ExecPol,B,Params...>(int, const B &, Params...) [with ExecPol=RAJA::policy::omp::omp_parallel_for_exec, B=lambda [](int, double &, double &, double &)->void, Params=<detail::Reducer<RAJA::operators::plus<double, double, double>, double>, detail::Reducer<RAJA::operators::minimum<double, double, double>, double>, detail::Reducer<RAJA::operators::maximum<double, double, double>, double>>]"

/g/g20/chen59/work/RAJAnewreduction/rajamikereduction2/new_reduce.cpp(85): here

Aug 30 '21 23:08 rchen20

@adrienbernede @trws @rhornung67 After merging in develop w/ the new Gitlab-CI it looks like all of the tests here are failing due to an ambiguous call to get<> from camp. Is this possibly because the gitlab CI is using the spack built camp which is different to our current submodule? I'm fairly certain it was using the spack built camp before. Any ideas what might be going on here?

Oct 06 '22 21:10 mdavis36

@mdavis36 camp main was updated today and this is what we have in the RAJA spack package for the CI: https://github.com/LLNL/RAJA/blob/develop/scripts/spack_packages/raja/package.py#L98

Could that be the culprit? The only thing I can suggest without digging further is to try building with camp before the update and see if that fixes it. @MrBurmark would your camp PR merge https://github.com/LLNL/camp/pull/117/files earlier cause this?

Oct 06 '22 21:10 rhornung67

@rhornung67 Yep, thats the issue. I put up a PR #1342 with the fix.

Oct 06 '22 22:10 mdavis36

@adrienbernede the xl_16_1_1_12_gcc_8_3_1_cuda_11_0_2 build from the radiuss-shared-ci is failing because of the cuda version here. Can we remove that spec from the shared ci, or is there a good way to disable it locally?

Oct 10 '22 23:10 mdavis36

@adrienbernede the xl_16_1_1_12_gcc_8_3_1_cuda_11_0_2 build from the radiuss-shared-ci is failing because of the cuda version here. Can we remove that spec from the shared ci, or is there a good way to disable it locally?

I wanted to confirm the failure with all the flags and the cuda arch set like in other cuda specs. The spec for this test was:

+openmp  +cuda %[email protected] cxxflags="--gcc-toolchain=/usr/tce/packages/gcc/gcc-8.3.1 -qthreaded -std=c++14 -O3 -qstrict -qxlcompatmacros -qlanglvl=extended0x -qalias=noansi -qhot -qpic -qsmp=omp -qsuppress=1500-029 -qsuppress=1500-036" cflags="--gcc-toolchain=/usr/tce/packages/gcc/gcc-8.3.1" cuda_arch=70 ^[email protected] ^[email protected]

Hidden in the middle of the logs (due to the triggering of a verbose build after the failed one), I found the error message (I initially thought the target was simply timing out...):

[  3%] Building CUDA object exercises/CMakeFiles/scan.dir/scan.cpp.o
/g/g91/bernede1/.jacamar-ci/builds/zkQzXxmM/000/gitlab/bernede1/raja_nomirror/include/RAJA/index/IndexSet.hpp(354): warning #3056-D: calling a __host__ function("RAJA::detail::ReduceOMP<double,  ::RAJA::reduce::sum<double> > ::~ReduceOMP") from a __host__ __device__ function("RAJA::ReduceSum< ::RAJA::policy::omp::omp_reduce, double> ::~ReduceSum") is not allowed
/g/g91/bernede1/.jacamar-ci/builds/zkQzXxmM/000/gitlab/bernede1/raja_nomirror/include/RAJA/index/IndexSet.hpp(354): warning #3056-D: calling a __host__ function("RAJA::detail::ReduceOMP<double,  ::RAJA::reduce::sum<double> > ::~ReduceOMP") from a __host__ __device__ function("RAJA::ReduceSum< ::RAJA::policy::omp::omp_reduce, double> ::~ReduceSum [subobject]") is not allowed
ptxas fatal   : Unresolved extern function '_ZN4camp14tuple_cat_pairINS_5tupleIJPN4RAJA4expt6ValLocIiEEEEES7_JLl0EEJLl0EEEEDaOT_NS_7int_seqIlJXT1_EEEEOT0_NSA_IlJXT2_EEEE'
gmake[2]: *** [examples/CMakeFiles/forall-param-reductions.dir/build.make:76: examples/CMakeFiles/forall-param-reductions.dir/forall-param-reductions.cpp.o] Error 255
gmake[1]: *** [CMakeFiles/Makefile2:17031: examples/CMakeFiles/forall-param-reductions.dir/all] Error 2
gmake[1]: *** Waiting for unfinished jobs....

So, OK, I’ll push a deactivation of that job. I don’t think we need to run it anymore (even with allowed failure).

Oct 11 '22 12:10 adrienbernede

@mdavis36 is this ready to merge when we get a few approvals?

Oct 12 '22 15:10 rhornung67

@rhornung67 Almost, I realized I didn't port omp-target over from the prototype code. Let me try to get that in today and then I think we'll be ready to go.

Oct 12 '22 16:10 mdavis36

Sounds good. Thanks @mdavis36 It will be interesting to see how much faster this is than what we currently have for OpenMP target. 😄

Oct 12 '22 16:10 rhornung67

@trws @MrBurmark @artv3 @rhornung67 I think this is ready for another pass when you guys get a chance.

Oct 13 '22 01:10 mdavis36

@mdavis36 I activated this branch on RTD so we can review the generated sphinx docs: https://readthedocs.org/projects/raja/versions/

Oct 14 '22 19:10 rhornung67

@mdavis36 I pushed a commit with some changes to your docs: clarify some points, fix some formatting and typos.

Oct 14 '22 23:10 rhornung67

Will wait for #1273 to merge before this as the changes to the expt namespace will create conflicts.

Oct 18 '22 16:10 mdavis36

Will wait for #1273 to merge before this as the changes to the expt namespace will create conflicts.

okay, we can merge #1273 first.

Oct 18 '22 16:10 artv3

This PR has introduces a lot of doc trees, @mdavis36 could you have pushed a build of the sphinx docs by mistake?

Oct 19 '22 15:10 artv3

This PR has introduces a lot of doc trees, @mdavis36 could you have pushed a build of the sphinx docs by mistake?

@artv3 good catch. @mdavis36 please remove the extraneous doc files that should not be in the repo. It looks like the entire directory docs/_build should be deleted.

When I'm working on docs for a PR, I usually go into RTD and activate the branch so I can see how they will look when generated in the same environment that folks will look at them. They will be regenerated each time you push a change to the activated branch. You don't have to build the sphinx docs locally and would avoid accidentally committing extraneous doc files.

Oct 19 '22 15:10 rhornung67

This is green now and I believe all comments have been addressed, anyone want to take a final look before I merge this?

Oct 19 '22 19:10 mdavis36

@mdavis36 looks good to go. merge at will!

Oct 19 '22 19:10 rhornung67

New forall & reducer API

Summary

RAJA::Forall Interface & ForallParam Objects

Reduce

ValLoc Reductions

KernelName

Lambda Arguments

Implementation

ForallParamPack

ParamMultiplexer

ForallParam policy specialization

New Reducers Application Integration

Testing and Examples/Exercises

Prototype and Future Work

Other Changes