RAJA icon indicating copy to clipboard operation
RAJA copied to clipboard

Add min blocks per SM policies for all relevant GPU policy cases

Open rhornung67 opened this issue 3 years ago • 5 comments

A recent PR added this capability to RAJA::kernel policies. This should be double-check that all policy cases are covered. It should also be added to forall policies as well as all relevant HIP policy cases.

Here is the PR link: https://github.com/LLNL/RAJA/pull/1039

Here is a link to the associated test issue: https://github.com/LLNL/RAJA/issues/1040

Notes/To-Do:

  • [x] Set default in existing policy to 1 block per SM
  • [ ] Other kernel policies (if applicable)
  • [x] Teams
  • [x] Forall
  • [ ] Repeat for HIP
  • [ ] Repeat for SYCL?
  • [x] cuda_occ requires blocks per SM to be initialized to 0, will calculate blocks per SM automatically

rhornung67 avatar May 05 '21 20:05 rhornung67

@rchen20 Do you have time to tackle this in the near future?

rhornung67 avatar May 05 '21 20:05 rhornung67

@rchen20 Do you have time to tackle this in the near future?

@rhornung67 Yes, I'll do this. I guess we will need this before the next release?

rchen20 avatar May 06 '21 16:05 rchen20

@rchen20 Do you have time to tackle this in the near future?

@rhornung67 Yes, I'll do this. I guess we will need this before the next release?

Not necessarily. Ben is the only one who asked about it and his use case is covered for now.

rhornung67 avatar May 06 '21 17:05 rhornung67

@rhornung67 @MrBurmark Do we want the user to be able to specify min blocks per SM in forall? If so, this would require 2 template arguments in the cuda_exec policy, e.g. cuda_exec<BlockSize, BlocksPerSM, Async>, which would force a hard change to the API. Or did we want to simply clarify the existing __launch_bounds__ to use 1 block per SM (which we do by default)?

Alternatively, if we really want the user to be able to specify min blocks per SM in forall, I could put this functionality into the expt::cuda_launch_t<> policy.

rchen20 avatar Nov 18 '21 21:11 rchen20

@rchen20 This is true, we probably shouldn't break people unnecessarily. How about adding a lower level cuda_exec_explicit policy that takes all three <BlockSize, BlocksPerSM, Async> and the existing policies can map to the new policy?

MrBurmark avatar Nov 18 '21 22:11 MrBurmark