tract Avx512 perf improvements

This PR brings a few things to the avx512 linalg kernels added in #989:

Kernels are now auto-generated based on mr/nr from the build.rs script
Wider kernels are available
Fixed: The benchmarks were not taking cold cache into account anymore. Turns out (0..10000000).collect::<Vec<_>>() gets optimized out completely by llvm now, meaning all my cold cache results were wrong :)
Added a benchmark that tests every avx512 kernel size, and a python script that a csv with the cache-cold throughputs for all the kernels
Kernel selection has been updated with the wider kernels and to account for the cold cache results
Various improvements on the assembly code - scatter gather ops

Kernel results Results are in Gelem/s. from a cold start by ruining the cache beforehand and M=1024, K=1000 Intel(R) Xeon(R) Gold 6334 CPU @ 3.60GHz

future work

I think i'm pretty much done on the asm kernel part, I think I won't be able to squeeze out any more perf there

On a slightly higher level, border tile handling is still suboptimal: we can still improve the perf when N is low. As you can see from the graph: Gelem/s

Between N=14 and N=15, we have a big drop from 95 Gelem/s with the 32x14 kernel to 63 Gelem/s with the 23x8 kernel This is because with the x14 kernel, we compute 13 useless elements when N=15 from the border tile of the C matrix. This gets better as N gets bigger, as the waste ratio is lower. Gelem/s

On a higher level, @kali is working on collapsing consecutive dimensions in input matrices in matmuls. I think this may solve why I'm still getting a run time of 6s instead of 300ms on my transformer models, so I don't think I'll start bothering with the border tiles just yet

Other than that, I don't think we're that far from being on par with onnxruntime perf! :)

Mar 12 '23 17:03 cchudant

Awesome work again @cchudant :)

I don't have much in terms of setup for double checking intel implementations... So... @tgolsson, would you like to do a review here ? Even better, check on some of your use-cases if this has unfavorable impact ?

Mar 13 '23 12:03 kali

Awesome! Yes; I'm quite busy rest of the day but I can look tomorrow! I haven't had time to test the AVX512 changes yet; so I'll try to do base->AVX512->this deltas!

Mar 13 '23 13:03 tgolsson

FYI started looking at benchmarks but since we hadn't bumped from version 0.17 I've got a bit of work to do. Symbolication/batch dimensions have changed a bit so it's crashing in concretization. I'll do a review after lunch at least.

In case you have a thought @kali this is the issue:

[/home/ts/Repositories/cervo/crates/cervo-core/src/inferer/helpers.rs:38] &model.symbol_table = unk__16819 unk__16820 unk__16821 unk__16822 N
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Failed analyse for node #65 "model_4_concatenate_concat" InferenceConcat

Caused by:
    0: Infering facts
    1: Applying rule inputs[0].shape[0] == inputs[1].shape[0]
    2: Impossible to unify Sym(unk__16820) with Sym(unk__16819).', src/main.rs:66:62

Happens here on this upgrade PR: https://github.com/EmbarkStudios/cervo/pull/44/files#diff-4a7fb40e77a22ccba64ba761a0c31ab388127a6309b79e1c7832602c1755d3dcR39. If you want to try the code,

$ cd benchmarks/perf-test
$ cargo run batch-scaling 100 "/tmp/dump" -o ../../brains/test-large.onnx -b `seq -s, 1 24`

Will try batch-sizes 1..24 of all the various batching mechanisms cervo has.

Mar 14 '23 11:03 tgolsson

It looks to me like a typical case of overspecification of inputs and output in ONNX. (Prior to 0.19, tract was ignoring them, this was a bug. Now it's trying to make sense of them.) Try to cancel the the output_shapes: .with_output_facts(0, Default::default()) immediately after loading the onnx file, before it gets analysed

Mar 14 '23 11:03 kali

OK; now I've managed to get this to run. I've compared my current production (0.17) to 0.19.2[^1], main, and this branch. You can see this below. This is a conv-stack of 32x32 followed an MLP [1024,1024]. This branch is definitely an improvement. But there's some weird ones compared to avx256 like getting much worse in the 6-wide case and then much better in the 7-wide case? Maybe I'll see the reason why in the code.

avx512

[^1]: It looks like something's fishy with version - main is on 0.19.3-pre, but latest release is 0.19.7? Not sure what to compare against. But I'm hoping there's not a huge difference between 0.19.2 and whatever is used here right now.

Starting review now!

Mar 14 '23 13:03 tgolsson

ah, that's kinda bad my kernels are actually slower than avx256? there is something wrong with my methodology then?

Mar 14 '23 14:03 cchudant

Just ignore the pre-version tag on main. main will become 0.20. There is a maintenance branch for 0.19 called 0.19.x

Mar 14 '23 15:03 kali

Out of curiosity, have you ran this under anything like VTune or Advisor? I'm curious what they'd say and identify as concerns.

Mar 14 '23 15:03 tgolsson

Out of curiosity, have you ran this under anything like VTune or Advisor? I'm curious what they'd say and identify as concerns.

Yes, vtune. Obviously there is something I'm missing here though, so i'll get back on this.

Do you have a script so that I can try and reproduce your results?

Mar 14 '23 15:03 cchudant

Clone cervo on the ts/tract-0193 branch. You'll need to edit the root Cargo.toml and benchmarks/perf-test/Cargo.toml to override to whichever git you want to test with. To use from crates.io you'll also need to change the pinned versions in all other Cargo.toml-s to something like =0.19.2 since 0.19.3-pre isn't on crates.io

Then you can run it like this:

$ cd benchmarks/perf-test
$ cargo run batch-scaling 10000 "some-path/some-name.csv" -o ../../brains/test-large.onnx -b `seq -s, 1 24`

Once you've tested a bunch of PRs you can do

$ python3 ../python/compare_batchsize.py \
       some-path/first.csv,some-path/second.csv,... \
       test1label,test2label,... \
       10000 \
       some-path/output.png

The script should handle any number of comparisons, they're all relative to the first file. E.g., mine was

python3 ../python/compare_batchsize.py \
      ../out/batchsize-avx256-017.csv,../out/batchsize-avx256.csv,../out/batchsize-avx512.csv,../out/batchsize-avx512-imp.csv \
      avx256-0.17,avx256-0.19.2,avx512,avx512b \
      10000 \
      ../out/batchsize-compare-dynamic.png

Mar 14 '23 15:03 tgolsson

Nowhere near as pretty, but more importantly worse values than yours:

Mar 14 '23 17:03 tgolsson

Okay, I did not have that much time to look at this today, but I managed to replicate the regression locally, and looking at vtune, apparently i'm spending wayy too much time on the MMV kernels it looks like?

I was very worried that this was something I wouldnt be able to replicate, seeing that your kernel times are so different. Fortunately it seems that's not the case

I'll look further tomorrow, but something is fishy here and it's my fault :)

Mar 15 '23 20:03 cchudant

How easy would it be to remove loop unrolling? We could be cramping the instruction decoder/cache.

Mar 16 '23 08:03 tgolsson

Very easy, it's just setting the loop unroll variable to 1 and commenting the jumps That's a good point, i'll test that when i get back to it

Mar 16 '23 13:03 cchudant

Hey folks, I'm planning on cutting 0.20.x in a week or so. Do we have a path towards making these optimisations a part of it ?

Apr 18 '23 07:04 kali