tract
tract copied to clipboard
Ts/avx 512f tmp
Re: #787, poke @kali
This is a dump of my AVX512f branch with kernels and some notes. I'll be cleaning this up a bit more before I think it's fine for use by someone else; just a bit easier to track in the PR format than scrolling through all the diffs. A few notes:
-
I've found both AVX256 and AVX512 to be very sensitive to memory bandwidth for narrow kernel sizes & large weights. A single-lane kernel of AVX512 runs at exactly the same speed as AVX256 for a lot of the workloads I deal with (lots of 1024-sized matrices, commonly larger than L2), and you're not getting huge benefits even for e.g. 6-wide kernels at those matrix sizes. At 256-large matrix (smallest I regularly use 🤡) you have enough iop cache/lookahead that it works out better - the prefetcher can just prefetch all the way without struggles.
-
The benchmark files (both high and low level ones) I've treated as very much "editable exploration" more so than fire-and-forget benchmarks. Mostly related to memory bandwidth above.
-
If you see very good results with Direct-Memory-Operands, you've broken the benchmarks. The CPU will omit DMO loads if it can (using private registers etc), whereas an explicit load will always occur. This is why e.g. bench x86_64 tries to emulate memory striding.
-
I definitely suggest as a follow-up to follow the same pattern as for ARM with a kernel-selection-network. It might even turn out that for CPUs with heavy downclocking AVX512 running the "theoretically" equally performant AVX256 kernel is faster.
-
There's some funky MMM kernels that seem to provide some benefit during testing, such as linalg/x86_64/avx512/1x1/packed_packed_loop1/unroll-16.tmpli. This essentially trades off a bunch of loads for shuffles/rotates... not sure if it's always valid and likely depends very much on microarch/port allocation.
-
The way I've done this so far is essentially duplicating the FMA directory to the AVX512 and then cleaning up there - I think it's worked out OK; but obviously makes the PR massive. There's some remnants in here where I tried to parametrize FMA; but it became a pain to manage the templates. Will remove those.
-
This is a WIP that I was hoping to finish during summer, but got pulled into other higher level code. Feel free to swear at how messy it is ;-)
@kali Done cleanup I think. I think what's left to do is mostly build the top-layer kernels and validate the tanh/sigmoid ops that I haven't checked.
Some pointers for what I was looking at more in this work:
- Making more of the kernel parts templates - I think store definitely, and some other ones might work.
- I've moved the A,B striding into the packed-packed impl. This works better during testing as it prevents DMO issues eliding loads.
- As mentioned above, auto-kernel selection. My scheme won't scale. :(
- Wider kernels for AVX-512 to achieve higher arithmetic intensity.
Edit: also, I'm going on vacation for two weeks so I might be a bit worse at responding due to traveling (even though this work has been mostly free-time so far...). I'll happily help out with testing etc once I get back! And depending on your timeline, I was planning to pick up this work again after the vacations so might drive it home if you haven't got to it yet.
Hey @tgolsson, thanks for making this available.
I fixed a couple of correctness issues here and there, and put on a simple heuristic to pick a avx512f mmv.
But I could not find the top-level files for the mmm (non mmv) avx512f kernels. By any chance, did your write them and forgot to add them ?
Intel did a great job segmenting the AVX512F landscape like crazy, some chips have two fma lanes, some have just one. So going beyond a "naive" kernel may be a very messy business, requiring to bench on a dozen different processor. And one thing I am struggling with in this particular work is... I don't have any target. I don't own any avx512f-able computer around that I can use to run benches. And Sonos has no direct interest in x64, I'm pretty much on my own. So as a first milestone, I would like a chip-naive mmv (done) and a chip-naive square-ish kernel (16x12) mmm online. The square-ish kernel is what image classifier uses, and people are evaluating on these models, so I hope this would help close the performance gap of tract and onnxrt (and others). I know the square-ish kernel is not good for the small-batch-DNN use cases you're after, I'm happy to write that one myself of course. We can plug the skinny kernels pretty quickly if you find them helpful, then move on to the weird chip dependent optimisations.
I'm actually happy to merge quite early, actually even without any operational mmm kernel. I don't mind having dozens of unplugged candidate kernels loops laying around. We should have a bench-driven approach here, and this is documenting what we've tried and did not work so far.
I might've forgotten a stash or two, but I did not have a complete set of kernels. I found the workflow for the FMA kernels quite frustrating when I added all the variations so one thing I want/ed to do was parametrize and use more templates. Fundamentally I believe most of those can be reduced to a bunch of loops and a template that takes register width-height. Maybe with some overrides for special kernel approaches such as the shuffle/rotate one I linked. Hadn't quite gotten that far, but that was my plan once I'd ironed out the details of MMV and 1-2 MMM cases.
Re: AVX-512f dev, I agree completely. I found a great resource around AVX-512f on my workstation, and as I recall most recent Xeon chips should have dual lane, but the consumer/prosumer space is much more fragmented indeed. I figured from your phrasing in #787 that another team at Sonos were interested in this work, but I take it this is then more of a tract/OSS user? If there's details in here that you can't share publicly feel free to shoot me an email. I'd love to see what we can do together, to replace our ORT usage with tract across all our services - but that'd also potentially require inter-op or intra-op parallelism for latency reasons. Maybe I can even put some "official" time on that... :-)
I agree with the bench-driven approach indeed - I've done a lot of work using the Intel oneAPI suite and modifying the benches for various cases. Sadly it's quite frustrating to work with when one normally develops on Linux. However, I haven't seen any significant avenues for perf improvement there - for most large weight matrices the tall-skinny case is just stalling on memory, and there's no easy ways to scale arithmetic intensity there outside the hyper-rollout cases with rotates etc I did.
I think definitely improving the bench scripts to sweep a many different cases would be good - I'm hyper-optimizing for my 1024x1024 and 256x256 typical weights matrices and hoping it generalizes well enough to other workloads. To be fair, I do think that is better than the original one which didn't account for cache/memory pressure, but yeah... Not optimal for either case.
Either way, I think I can commit to filling in some more kernels etc once I get back from vacation, so if you want to merge now and wait for that I'm OK with it. Or wait 2 weeks and we can do the full house then. :-)
I might've forgotten a stash or two, but I did not have a complete set of kernels. I found the workflow for the FMA kernels quite frustrating when I added all the variations so one thing I want/ed to do was parametrize and use more templates. Fundamentally I believe most of those can be reduced to a bunch of loops and a template that takes register width-height. Maybe with some overrides for special kernel approaches such as the shuffle/rotate one I linked. Hadn't quite gotten that far, but that was my plan once I'd ironed out the details of MMV and 1-2 MMM cases.
I agree most of it could be generated from templates. Some projects actually generate and assemble kernel JIT... Not sure if we want to go that far, but it's right that the current approach duplicates the top level files and with the multiplication of kernels, it does not scale. Overrides for special approaches are a must if we want to be able to use the same pattern for ARM. All Sonos current devices ships are in-order, they required a painfully iterative process of trial and error to get the instruction order right...
Re: AVX-512f dev, I agree completely. I found a great resource around AVX-512f on my workstation, and as I recall most recent Xeon chips should have dual lane, but the consumer/prosumer space is much more fragmented indeed. I figured from your phrasing in #787 that another team at Sonos were interested in this work, but I take it this is then more of a tract/OSS user? If there's details in here that you can't share publicly feel free to shoot me an email. I'd love to see what we can do together, to replace our ORT usage with tract across all our services - but that'd also potentially require inter-op or intra-op parallelism for latency reasons. Maybe I can even put some "official" time on that... :-)
Sorry, that comment was mis-leading. It's an out-of-Sonos team. I'll ping them offline, see if they want to take part of the conversation and more. Intra-op parallelism (for mmm) has been experimented before, see https://github.com/sonos/tract/discussions/690 . Results were a bit disappointing, but I think we have made the im2col better since, so it may be better.
I agree with the bench-driven approach indeed - I've done a lot of work using the Intel oneAPI suite and modifying the benches for various cases. Sadly it's quite frustrating to work with when one normally develops on Linux. However, I haven't seen any significant avenues for perf improvement there - for most large weight matrices the tall-skinny case is just stalling on memory, and there's no easy ways to scale arithmetic intensity there outside the hyper-rollout cases with rotates etc I did.
Yep, that's to be expected. I suspect the chip designers actually use the square GEMM case (among others) to arbiter arithmetic and memory bandwidth, so the skinny cases moves away from the sweetspot. I would be surprised we could do much better that generate simple assembly and let the chips do their things.
I think definitely improving the bench scripts to sweep a many different cases would be good - I'm hyper-optimizing for my 1024x1024 and 256x256 typical weights matrices and hoping it generalizes well enough to other workloads. To be fair, I do think that is better than the original one which didn't account for cache/memory pressure, but yeah... Not optimal for either case.
Well, I think there is value in trying to benching the arithmetics problem separately. On in-order ARM, it was vital for fiddling with operation order, specifically because on these chips, the cache is non-deterministic. But in the end, we don't want a fast kernel, we want is a fast multiplier. Ideally, we should have benches at every level where we introduce new complexity. On my current performance "model", there is a performance gap between the loop performance and the multiplier performance that I am not able to explain fully, so I'm glad to grab more information where ever we'll find it.
One thing that tract does not do and that I think will be worth going into as we are speaking about bigger products is the "7-loop" thing. tract multiplies over 5 loops: from outer to inner, m/mr and n/nr in rust, then k, mr, nr in the kernel. To improve memory efficiency, GotoBlas and more recently BLIS have well documented that introducing two more outer loops (one over k, one over m or n) is beneficial for cache efficiency. This is not something I have explored, but the way the kernel interface is designed should already work so it would be a rust-side thing to implement.
Either way, I think I can commit to filling in some more kernels etc once I get back from vacation, so if you want to merge now and wait for that I'm OK with it. Or wait 2 weeks and we can do the full house then. :-)
I may just wait. I'm on vacation too, and will be AFK most of next week, so...
Hi @kali @tgolsson , I work at Mithril Security. and I'd like to continue the work started on this branch. Anything I need to know that is not already on this PR?
Hey @feldspath!
I don't think there's more than there is on this branch, at least from my perspective.
I should have a bunch of WIP stuff to generate kernels somewhere (which I started after vacation but never finished), but never got it to work and isn't in a state where it makes sense for someone else to try to fix it. As I recall, the way the kernels use templating is deprecated due to lack of scoping, so I ended up having to redo all of that as well and it just ballooned in complexity. I do think the approach is worth exploring still. Maybe with a proper script to generate the variants instead of handlebars. Though doing it in one go became too much for the time I could afford to spend on it.