iree [Codegen][GPU] Add range information to GPU dispatch IDs

NOTE: Don't land this until upstream PR https://github.com/llvm/llvm-project/pull/95166 lands so I can use gpu.thread_id x upper_bound X.

First, this patch implements InferIntRangeInterface for hal.interface.workgroup.{size,id,count} using a local upper_bound attribute.

Then, it adds a -iree-codegen-gpu-propagate-dispatch-size-bounds pass that adds these upper_bounds identifiers to the interface.workgroup operations and to gpu.thread_id based on static information available late in the codegen pipeline.

Then, it uses -int-range-optimizations and -arith-unsigned-when-equivalent to optimize indexing after -lower-affine, getting rid of a bunch of "if the input's negative" logic that isn't actually needed in many of our kernels.

It also ensures that these upper_bonud values propagate to LLVM (at least on ROCDL - NVVM doesn't have this hooked up to the intrinsics).

Finally, this commit updates the ROCDL target to use upstream's lowerings - and fixes the parsing of the architecture name.

Jun 19 '24 18:06 krzysz00

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Jun 19 '24 18:06 google-cla[bot]

(is there an AMD list for the CLA or should I do that on an individual basis?)

Jun 19 '24 21:06 krzysz00

(is there an AMD list for the CLA or should I do that on an individual basis?)

Select 'For myself' or something along these lines

Jun 19 '24 21:06 kuhar

FYI there are some tips about setting up DCO here: https://iree.dev/developers/general/contributing/#developer-certificate-of-origin

(either with -s on your commits or setting up your ssh key to automatically sign)

Jun 26 '24 15:06 qedawkins

What's a good way to stack PRs? This depends on #17771

Jul 05 '24 21:07 krzysz00

What's a good way to stack PRs? This depends on #17771

Create a branch in this repo instead of a fork (named users/[username]/[branch name] - see https://iree.dev/developers/general/contributing/#branch-naming) and point your dependent PRs at the branch on the first PR
Create separate commits for each PR and direct reviewers to only look at the later commits (this is harder to keep sane with rounds of review comments and rebasing, git absorb can help a bit though)

Jul 08 '24 18:07 ScottTodd

@ScottTodd Sadly, I don't have write on the repository (yet?) so I can't make a stacking branch

Jul 08 '24 18:07 krzysz00

One of the reasons I'd like this is because it translates to range() down in LLVM, whereas I figure these util.assume things don't - or, if they do, they translate to the much harder for the backend to reason about llvm.assume

Oct 07 '24 22:10 krzysz00

For status notes: I'll get back to this after I've landed some planned changes upstream that'll let the affine composition logic pick up bounds we impose on gpu.thread_id, since there's a bunch of trivial divisions by 64, subtractions, and so on that should be simplified out

Oct 08 '24 20:10 krzysz00

I believe this PR caused a regression with Llama3.1 in Shortfin. Error is also reproduceable in iree-test-suites:

IREE Issue shark-ai Issue

Dec 03 '24 21:12 stbaione