RAJA
RAJA copied to clipboard
Add min blocks per SM policies for all relevant GPU policy cases
A recent PR added this capability to RAJA::kernel policies. This should be double-check that all policy cases are covered. It should also be added to forall policies as well as all relevant HIP policy cases.
Here is the PR link: https://github.com/LLNL/RAJA/pull/1039
Here is a link to the associated test issue: https://github.com/LLNL/RAJA/issues/1040
Notes/To-Do:
- [x] Set default in existing policy to 1 block per SM
- [ ] Other kernel policies (if applicable)
- [x] Teams
- [x] Forall
- [ ] Repeat for HIP
- [ ] Repeat for SYCL?
- [x] cuda_occ requires blocks per SM to be initialized to 0, will calculate blocks per SM automatically
@rchen20 Do you have time to tackle this in the near future?
@rchen20 Do you have time to tackle this in the near future?
@rhornung67 Yes, I'll do this. I guess we will need this before the next release?
@rchen20 Do you have time to tackle this in the near future?
@rhornung67 Yes, I'll do this. I guess we will need this before the next release?
Not necessarily. Ben is the only one who asked about it and his use case is covered for now.
@rhornung67 @MrBurmark Do we want the user to be able to specify min blocks per SM in forall
? If so, this would require 2 template arguments in the cuda_exec policy, e.g. cuda_exec<BlockSize, BlocksPerSM, Async>
, which would force a hard change to the API. Or did we want to simply clarify the existing __launch_bounds__
to use 1 block per SM (which we do by default)?
Alternatively, if we really want the user to be able to specify min blocks per SM in forall
, I could put this functionality into the expt::cuda_launch_t<>
policy.
@rchen20 This is true, we probably shouldn't break people unnecessarily. How about adding a lower level cuda_exec_explicit
policy that takes all three <BlockSize, BlocksPerSM, Async>
and the existing policies can map to the new policy?