warpx icon indicating copy to clipboard operation
warpx copied to clipboard

[WIP] PushPX: GPU kernel optimization

Open WeiqunZhang opened this issue 3 years ago • 11 comments

The GatherAndPush kernel in the PushPX function has a very low occupancy due to register pressure. There are a number of reasons. By default, we compile with QED module on, even if we do not use it at run time. Another culprit is the GetExternalEB functor that contains 7 Parsers. Again, we have to pay a high runtime cost, even if we do not use it. In this PR, we move some runtime logic out of the GPU kernel to eleminate the unnecessary cost if QED and GetExternalEB are not used at run time.

Here are some performance results before this PR.

| QED | GetExternalEB | Time |
|-----+---------------+------|
| On  | On            | 2.17 |
| Off | On            | 1.79 |
| Off | Commented out | 1.34 |

Note that in the tests neither QED nor GetExternalEB is actually used at run time. But the extra cost is very high. With this PR, the kernel time is the same as that when both QED and GetExternalEB are disabled at compile time, even though both options are disabled at run time.

More information on the kernels compiled for MI250X. The most expensive variant with both QED and GetExternalEB on has

NumSgprs: 108
NumVgprs: 256
NumAgprs: 40
TotalNumVgprs: 296
ScratchSize: 264
Occupancy: 1

The cheapest variant with both QED and GetExternalEB disabled has

NumSgprs: 104
NumVgprs: 249
NumAgprs: 0
TotalNumVgprs: 249
ScratchSize: 144
Occupancy: 2

WeiqunZhang avatar Sep 14 '22 22:09 WeiqunZhang

This version works for HIP and nvcc 17. Hopefully it will pass the CIs.

WeiqunZhang avatar Sep 14 '22 22:09 WeiqunZhang

@maikel showed us a very cool trick. https://cuda.godbolt.org/z/edxEMY7YG

WeiqunZhang avatar Sep 16 '22 16:09 WeiqunZhang

Oh awesome, that's the Cartesian product we need :rocket: :sparkles: Shall we wait for this to be finished? We can use this in many places :)

ax3l avatar Sep 16 '22 18:09 ax3l

Yes, we could wait till the functionality is in amrex.

WeiqunZhang avatar Sep 16 '22 18:09 WeiqunZhang

If I can comment here, can this be done with templating instead of using the more obscure std::is_same<decltype..., similar to the templating for doParticlePush? A few comments in the code would be helpful, saying that this is being done to reduce register pressure, avoiding calls to the external fields and QED stuff when not used. And also, the lambda is probably big enough to be a separate routine. Otherwise, this is great with a very nice speed up!

dpgrote avatar Sep 16 '22 21:09 dpgrote

Yes, the lambda is big enough. So we do not want to write more than once. Also it captures so many variables, it will also be error prone if we use a non-lambda function because we might mess up the order of the variables in a function's parameters.

WeiqunZhang avatar Sep 16 '22 21:09 WeiqunZhang

@maikel showed us a very cool trick. https://cuda.godbolt.org/z/edxEMY7YG

Here is a N-dimensional version of that, it even compiles with gcc7.5. I am still unsure about that NVCC/NVHPC redefinition problem, however.

https://cuda.godbolt.org/z/xP9nKMYdM

AlexanderSinn avatar Sep 16 '22 22:09 AlexanderSinn

What's the redefinition problem with nvcc? The compiler explorer link compiles with nvcc.

WeiqunZhang avatar Sep 16 '22 22:09 WeiqunZhang

https://github.com/ECP-WarpX/WarpX/pull/3399#issuecomment-1247354064

AlexanderSinn avatar Sep 16 '22 22:09 AlexanderSinn

Oh that. I have no ideas.

WeiqunZhang avatar Sep 16 '22 22:09 WeiqunZhang

https://github.com/AMReX-Codes/amrex/pull/2954

WeiqunZhang avatar Sep 17 '22 06:09 WeiqunZhang

@WeiqunZhang we merged in the update of https://github.com/AMReX-Codes/amrex/pull/2954 to WarpX now. Feel free to ping me when this PR is rebased and ready to go :rocket:

ax3l avatar Oct 25 '22 22:10 ax3l

Yes. Thanks for reminding me!

WeiqunZhang avatar Oct 25 '22 22:10 WeiqunZhang

@ax3l It's ready for review.

WeiqunZhang avatar Oct 29 '22 19:10 WeiqunZhang

@WeiqunZhang There seems to be a remaining compilation error with clang in the CI ; is that correct?

RemiLehe avatar Nov 17 '22 23:11 RemiLehe

Failed to build pywarpx

I will try to rerun the job. If it still fails, I will merge development into this to see if that will fix it.

WeiqunZhang avatar Nov 17 '22 23:11 WeiqunZhang

Oh, I think I know why. It needs a more recent version of amrex due to clang is not happy with an amrex function. So I just merged development into this branch.

WeiqunZhang avatar Nov 17 '22 23:11 WeiqunZhang

All checks have passed.

WeiqunZhang avatar Nov 18 '22 01:11 WeiqunZhang

ping @lucafedeli88 FYI, as discussed :)

ax3l avatar Nov 18 '22 16:11 ax3l