amrex icon indicating copy to clipboard operation
amrex copied to clipboard

ParallelFor with compile time optimization of kernels with run time parameters

Open WeiqunZhang opened this issue 3 years ago • 11 comments

Branches inside ParallelFor can be very expensive. If a branch uses a lot of resources (e.g., registers), it can significantly affect the performance even if at run time the branch is never executed because it affects the GPU occupancy. For CPUs, it can affect vectorization of the kernel.

The new ParallelFor functions use C++17 fold expression to generate kernel launches for all run time variants. Only one will be executed. Which one is chosen at run time depends the run time parameters. The kernel function can use constexpr if to discard unused code blocks for better run time performance. Here are two examples of how to use them.

int runtime_option = ...;
enum All_options : int { A0, A1, A2, A3};
// Four ParallelFors will be generated.
ParallelFor(TypeList<CompileTimeOptions<A0,A1,A2,A3>>{}, {runtime_option},
            box, [=] AMREX_GPU_DEVICE (int i, int j, int k, auto control)
{
    ...
    if constexpr (control.value == A0) {
        ...
    } else if constexpr (control.value == A1) {
        ...
    } else if constexpr (control.value == A2) {
        ...
    else {
        ...
    }
    ...
});

and

int A_runtime_option = ...;
int B_runtime_option = ...;
enum A_options : int { A0, A1, A2, A3};
enum B_options : int { B0, B1 };
// 4*2=8 ParallelFors will be generated.
ParallelFor(TypeList<CompileTimeOptions<A0,A1,A2,A3>,
                     CompileTimeOptions<B0,B1> > {},
            {A_runtime_option, B_runtime_option},
            N, [=] AMREX_GPU_DEVICE (int i, auto A_control, auto B_control)
{
    ...
    if constexpr (A_control.value == A0) {
        ...
    } else if constexpr (A_control.value == A1) {
        ...
    } else if constexpr (A_control.value == A2) {
        ...
    else {
        ...
    }
    if constexpr (A_control.value != A3 && B_control.value == B1) {
        ...
    }
    ...
});

Note that that due to a limitation of CUDA's extended device lambda, the constexpr if block cannot be the one that captures a variable first. If nvcc complains about it, you will have to manually capture it outside constexpr if. The data type for the parameters is int.

Thank Maikel Nadolski and Alex Sinn for showing us the meta-programming techniques used here.

Checklist

The proposed changes:

  • [ ] fix a bug or incorrect behavior in AMReX
  • [x] add new capabilities to AMReX
  • [ ] changes answers in the test suite to more than roundoff level
  • [ ] are likely to significantly affect the results of downstream AMReX users
  • [ ] include documentation in the code and/or rst files, if appropriate

WeiqunZhang avatar Sep 17 '22 05:09 WeiqunZhang

IMHO, what you do here is to introduce runtime polymorphism based on (multi dimensional) indices. As in: for each runtime md index I call a different function f<I>. The present solution expands a linear series of if conditions (without breaking on a match), which in theory is optimisable for many options. There are several ways to solve the task to provide the polymorphic behaviour based on an index and all have some up- and downsides. This touches the old question on how to dispatch efficiently. Either

  1. if-clause based (this solution)
  2. switch based (could be handcrafted, or {mpark,std}::variant based (depends on compiler vendor))
  3. tensor of function pointers (std::function possibly allocates memory, function_refs can dangle)
  4. solutions involving vtables

I don't try to say the present solution is per-se bad. But if the linear sequence of if-clauses turns out to be a problem any time in the future, then one solution could be to try a different dispatch method.

maikel avatar Sep 17 '22 09:09 maikel

Currently it should break after a match because of boolean short circuiting

AlexanderSinn avatar Sep 17 '22 10:09 AlexanderSinn

Indeed, you are right!

maikel avatar Sep 17 '22 11:09 maikel

As much as I like fold expressions, I made a second implementation using tuples and recursion. Notably this solves the capture problem as now expensive functionals can be used as control values, eliminating them if a NoOp is chosen instead. Additionally the CPU search of the correct option is now done for each option individually and not on the entire flattened multidimensional option space. This may take longer to compile, however.

NVCC: https://cuda.godbolt.org/z/ez3h61b7M GCC: https://cuda.godbolt.org/z/EvrceYocd

AlexanderSinn avatar Sep 21 '22 16:09 AlexanderSinn

Cool! I have not looked at it carefully yet. But one issue might be the use of std::tuple, which is not trivially copyable. That will be an issue for SYCL. So we probably will have to use our own tuple like amrex::GpuTuple.

WeiqunZhang avatar Sep 21 '22 16:09 WeiqunZhang

That should be ok as long as there are make_tuple() and tuple_cat() equivalents

AlexanderSinn avatar Sep 21 '22 16:09 AlexanderSinn

Yes, there are amrex::makeTuple and amrex::TupleCat.

WeiqunZhang avatar Sep 21 '22 16:09 WeiqunZhang

There is also an implementation which uses std::variant to lift runtime Indices to compile time without recursion. But I dont quite understand the problem with lambdas and therefore I dont know If it reintroduces the problem?

maikel avatar Sep 21 '22 17:09 maikel

It does indeed work, but it is a bit tedious to generate the variants NVCC: https://cuda.godbolt.org/z/h14zEz4zx GCC: https://cuda.godbolt.org/z/a64aKvdh8

AlexanderSinn avatar Sep 21 '22 20:09 AlexanderSinn

Hm yes, that is the direction that I thought of and the Variant factory can be improved. Lets see if it brings anything to the table

Edit: I mean I will try later to improve it

maikel avatar Sep 22 '22 03:09 maikel

I personally like the current fold expression approach. I think it's easier for the users.

WeiqunZhang avatar Sep 25 '22 17:09 WeiqunZhang