iree
iree copied to clipboard
Make Data-Tiling and Microkernels pluggable
Request description
Starting this thread to discuss changes needed to enable BYO microkernels to the llvm-cpu backend. We plan on keeping XNNPack-inspired microkernels in a separate repo and plug them into the IREE compiler. @MaheshRavishankar @hanhanW @dcaballe @bjacob what are your thoughts on design?
What component(s) does this issue relate to?
Compiler
Additional context
Sample ukernel: https://github.com/openxla/iree/pull/16259
Regarding data-tiling:
At the moment, data-tiling sizes are set in https://github.com/openxla/iree/blob/14927d15c19710bcdd4d630e62b21428424d6ef6/compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodingPass.cpp . There's nothing hard particularly about making this pluggable, we'd just need to understand precisely the dimensions of pluggability that are desired here (do you want this to be controllable... in the IR? in the iree-compile
command line? Or as new default iree-compile
behavior as in plugging directly into this code? The latter would probably just be a matter of a "compiler plugin").
Regarding ukernels:
- New microkernels (e.g.
iree_uk_somenewop
, separate from the existingiree_uk_mmt4d
) can be developed in a new subdirectory underthird_party
. The build system in that directory can be modelled after the existing ukernel directory, reusing existing shared helpers such asiree_bitcode_library()
. The tests and microbenchmarks can also reuse the existing shared libraries for tests and benchmarks underukernel/tools
. - For extensions of existing microkernels, as done by #16259 adding new element-types and tile-functions to the
iree_uk_mmt4d
ukernel:
- The core logic adding new supported element types (and reference-code tile functions, etc) needs to go in the existing
ukernel/
code, not inthird_party
. Earlier I recommended just moving all of #16259 tothird_party
but I was not thinking about these details. - New architecture-optimized tile functions, such as the one that #16259 adds for arm64, can go either into
third_party
or in the existingukernel/
directory.
I would be OK with the existing ukernel/ directory referencing code in the new third_party/ directory. Though if you want to be completely free to add as much code as you want in the third_party kernels without us worrying about the impact on all users, then you could put that behind a CMake option, off by default.
Another possibility is to allow specifying as a iree-compile
flag the path to an external bitcode file. Earlier I thought that that was already the case but it looks like we dropped that. Could easily be re-added; there's nothing particular about being embedded in iree-compile that makes bitcode special. https://github.com/openxla/iree/blob/14927d15c19710bcdd4d630e62b21428424d6ef6/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMCPUTarget.cpp#L486-L510
...So regarding #16259 specifically, I think that we could choose to actually just merge it as-is, or just leave out the arm64 part (that is the one part that's a candidate for going into a third_party directory).
Could you comment on what op you want the microkernel for? linalg.mmt4d
or some other op?
Could you comment on what op you want the microkernel for?
linalg.mmt4d
or some other op?
It would be for linalg.mmt4d
.
If we had kernels in third_party
or another repo, we would still need to modify files in IREE core to let it know that a certain ukernel exists and propagate/register information about how to use it (lhs/rhs/out data type combination, tile sizes, target, etc.).
Taking this example: https://github.com/openxla/iree/pull/16259, the following files were changed:
In CPUMaterializeEncodingPass.cpp, add:
if (hasUkernel(target) &&
lhs.isSignlessInteger(8) &&
rhs.isSignlessInteger(4) &&
out.isSignlessInteger(32)) {
return {
TileMxNxK{4, 16, 2}, // Aim to use SMLAL.
TileMxNxK{2, 16, 2}, // Truncation of the above.
TileMxNxK{1, 16, 2}, // Truncation of the above.
};
}
In CPULowerToUKernels.cpp, add:
else if (lhsElemType.isSignlessInteger(8) &&
rhsElemType.isSignlessInteger(4) &&
outElemType.isSignlessInteger(32)) {
flags = IREE_UK_FLAG_MMT4D_TYPE_S8S4S32;
}
In exported_bits.h, add:
#define IREE_UK_FLAG_MMT4D_TYPE_S8S4S32 0x0A
There's a difference between a new ukernel (very interesting) and a new arch variant for an existing in-tree ukernel - I thought that this issue would be about the former but it sounds like it's still the latter. I'm left questioning why a new arch-specific implementation of the mmt4d ukernel we have in-tree for several archs/variants already cannot be authored and submitted as a new contribution in-tree like any other contribution. The particular thing being discussed here is 50-100 LoC in a single function and a few extra cases in the compiler. I understand the xnnpack version has good performance and that good performance is desirable, but that doesn't mean it's the only way to do it nor that it has to be copied out. Directly taking snippets from other projects and mashing them into the codebase by sticking them in third_party/ is not how we want to do things, just as we don't want to be adding a third_party/stackoverflow/.
(if you do want to continue down the path of splicing in an arch variant like this, the best way for you to do so is maintain a fork with some patches - I don't suspect we'll be adding a way to do out-of-tree specializations of existing ukernels soon - I still think that's a lot of effort for a single function, though)
I'll be at the mai-tai meeting and can discuss. I've been getting game of telephone descriptions of what this is, and something isn't adding up.
This seems to me like one of those things that if it is a good idea, we want the ukernel upstream and enabled properly for everyone to use. And if not a good idea -- if I were figuring that out, would be done in a fork and experimented on to prove it disprove.
Beyond that, it just seems to be getting stuck on how to write and contribute the code? On that, the project has a pretty clear line that we don't take deps for hundred line internals. If the code is derived from a compatible project, reference/credit it and (re)write it until it passes the discretion of the reviewers. Pretty standard practice.
I think folks may be conflating the role of a fork/branch, out of tree target, and how to credit code. I know that Google has some historic and archaic guidelines on how to organize such things which are derived from how development is done in your monorepo, but that is not here... I think that may be tripping up some folks and causing over indirection. Happy to explain/guide f2f.
@mariecwhite :
- The CPUMaterializeEncodingPass change is exactly what I was referring to above regarding data-tiling. It really is about data-tiling, not ukernel. It's TBD how to make that pluggable from the outside, I don't know yet.
- The CPULowerToUkernels pass: You could clone this pass and have your own outside and let it be a compiler plugin.
@benvanik @stellaraccident : sorry i haven't been sufficiently explicit maybe --- with better understanding of the specifics of #16259, I have come to understand that it does belong in our main repo. It is Marie's own code not XNNPACK code, and it does something reasonable and useful (add a s8s4s32 case to mmt4d, when we already have s8s8s32 and s16u4s32). Further discussion here of letting ukernels go to some third_party directory is focused on future contributions, where the calculus may be different.
Ok. I'll still be at the mai-tai meeting: this has clearly used a bunch of people's time over email, and it is worth ten or fifteen minutes to make sure it isn't still oscillating.
We discussed for the the whole 30-min mai-tai meeting today. If I correctly understood what @stellaraccident said, the most significant bit was that we should talk less in terms of directories and more in terms of branches. A git branch allows you to do whatever you want without any extra boilerplate ( / build system / pluggability) as you're effectively working directly in the core code. Then if the experiment is successful, the git branch can be pulled into the main line.
Thanks, Benoit! Just to make sure we are all on the same page... We agreed that branches / downstream forks should be used for cases where further experimentation or data gathering is needed to make a case for the upstream landing. We also mentioned that ukernels that contribute to mainstream targets should live in IREE upstream when they reach the production quality needed. If they are inspired by ukernels in other projects, we should make sure that the license of those projects is compatible with IREE and an attribution note is added, but they should leave upstream.
Regarding Marie's PR, it was pointed out that the ukernel is production ready and should land upstream. If further ukernels are under development, we should follow the branch approach until they reach similar level of production readiness.
Is my understanding correct?
Thanks, Benoit! Just to make sure we are all on the same page... We agreed that branches / downstream forks should be used for cases where further experimentation or data gathering is needed to make a case for the upstream landing. We also mentioned that ukernels that contribute to mainstream targets should live in IREE upstream when they reach the production quality needed. If they are inspired by ukernels in other projects, we should make sure that the license of those projects is compatible with IREE and an attribution note is added, but they should leave upstream.
Regarding Marie's PR, it was pointed out that the ukernel is production ready and should land upstream. If further ukernels are under development, we should follow the branch approach until they reach similar level of production readiness.
Is my understanding correct?
Yes. My main point was more guidance: don't check speculative code into upstream and definitely don't add entire deps to evaluate experiments. Beyond that, the project wants the fastest things in tree for out of the box use on mainline platforms and we are all actively supportive of steps to get there.
It is fine to do experiments to inform direction of course, but that is what branches are for. It is also fine to go to the trouble of having various experimental or v1 vs v2 directories, but in general we add such things with the assumption that the code is production grade in form but needs iteration and collaboration to finish without disrupting what is there.
I feel like this is a lot of words for just general guidelines on how to make progress. When I sit down to do something, I almost always have a strong hypothesis at the outset of what form the final result will take if successful. If I've got big error bars on that, I do some prototypes and don't check those in (or put them in a side repo, etc). That is doubly true of adding deps or things that impact project structure. You want to make sure that what is being done will persist before investing in expensive mechanincs.
Sometimes we get these things wrong, and that is fine too. In any case, I think a) yes, this PR should land, and b) more/better/faster stuff is good and desirable in tree, but if it requires additional deps or a lot of structural movement, we want to make sure we want to be doing that before going for such consensus.
In spirit of having fastest code paths on by default, we should reconsider turning on microkernel usage for mmt4d on by default. We've kicked that bucket far enough and I don't see any reason for not pushing on making use ukernels for mmt4d the default.
I forget how things fit together and what our default flag state is but we're currently real bad on several metrics (peak memory consumption and utilization) with data tiling/mmt4d - though if data tiling is currently on by default enabling mmt4d ukernels shouldn't make those much worse so it's probably fine and if data tiling is on enabling mmt4d is mostly a win (beyond the current distribution issues that Benoit is going to be looking at). I strongly agree we should push for turning this all on by (both data tiling/mmt4d) default and focusing on clearing the big issues (pack/unpack distribution/propagation/fusion, and how consteval works in a data tiling world). We're close to good here :)
I've filed https://github.com/openxla/iree/issues/16401 to track blockers before we can enable the mmt4d ukernel by default on llvm-cpu.