[Codegen][GPU] Add range information to GPU dispatch IDs
NOTE: Don't land this until upstream PR https://github.com/llvm/llvm-project/pull/95166 lands so I can use gpu.thread_id x upper_bound X.
First, this patch implements InferIntRangeInterface for hal.interface.workgroup.{size,id,count} using a local upper_bound attribute.
Then, it adds a -iree-codegen-gpu-propagate-dispatch-size-bounds pass that adds these upper_bounds identifiers to the interface.workgroup operations and to gpu.thread_id based on static information available late in the codegen pipeline.
Then, it uses -int-range-optimizations and -arith-unsigned-when-equivalent to optimize indexing after -lower-affine, getting rid of a bunch of "if the input's negative" logic that isn't actually needed in many of our kernels.
It also ensures that these upper_bonud values propagate to LLVM (at least on ROCDL - NVVM doesn't have this hooked up to the intrinsics).
Finally, this commit updates the ROCDL target to use upstream's lowerings - and fixes the parsing of the architecture name.
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).
View this failed invocation of the CLA check for more information.
For the most up to date status, view the checks section at the bottom of the pull request.
(is there an AMD list for the CLA or should I do that on an individual basis?)
(is there an AMD list for the CLA or should I do that on an individual basis?)
Select 'For myself' or something along these lines
FYI there are some tips about setting up DCO here: https://iree.dev/developers/general/contributing/#developer-certificate-of-origin
(either with -s on your commits or setting up your ssh key to automatically sign)
What's a good way to stack PRs? This depends on #17771
What's a good way to stack PRs? This depends on #17771
- Create a branch in this repo instead of a fork (named
users/[username]/[branch name]- see https://iree.dev/developers/general/contributing/#branch-naming) and point your dependent PRs at the branch on the first PR - Create separate commits for each PR and direct reviewers to only look at the later commits (this is harder to keep sane with rounds of review comments and rebasing, git absorb can help a bit though)
@ScottTodd Sadly, I don't have write on the repository (yet?) so I can't make a stacking branch
One of the reasons I'd like this is because it translates to range() down in LLVM, whereas I figure these util.assume things don't - or, if they do, they translate to the much harder for the backend to reason about llvm.assume
For status notes: I'll get back to this after I've landed some planned changes upstream that'll let the affine composition logic pick up bounds we impose on gpu.thread_id, since there's a bunch of trivial divisions by 64, subtractions, and so on that should be simplified out
I believe this PR caused a regression with Llama3.1 in Shortfin. Error is also reproduceable in iree-test-suites: