Avx512 perf improvements
This PR brings a few things to the avx512 linalg kernels added in #989:
- Kernels are now auto-generated based on mr/nr from the build.rs script
- Wider kernels are available
- Fixed: The benchmarks were not taking cold cache into account anymore. Turns out
(0..10000000).collect::<Vec<_>>()gets optimized out completely by llvm now, meaning all my cold cache results were wrong :) - Added a benchmark that tests every avx512 kernel size, and a python script that a csv with the cache-cold throughputs for all the kernels
- Kernel selection has been updated with the wider kernels and to account for the cold cache results
- Various improvements on the assembly code - scatter gather ops
Kernel results
Results are in Gelem/s. from a cold start by ruining the cache beforehand and M=1024, K=1000
Intel(R) Xeon(R) Gold 6334 CPU @ 3.60GHz
future work
I think i'm pretty much done on the asm kernel part, I think I won't be able to squeeze out any more perf there
On a slightly higher level, border tile handling is still suboptimal: we can still improve the perf when N is low.
As you can see from the graph:
Gelem/s
Between N=14 and N=15, we have a big drop from 95 Gelem/s with the 32x14 kernel to 63 Gelem/s with the 23x8 kernel
This is because with the x14 kernel, we compute 13 useless elements when N=15 from the border tile of the C matrix.
This gets better as N gets bigger, as the waste ratio is lower.
Gelem/s
On a higher level, @kali is working on collapsing consecutive dimensions in input matrices in matmuls. I think this may solve why I'm still getting a run time of 6s instead of 300ms on my transformer models, so I don't think I'll start bothering with the border tiles just yet
Other than that, I don't think we're that far from being on par with onnxruntime perf! :)
Awesome work again @cchudant :)
I don't have much in terms of setup for double checking intel implementations... So... @tgolsson, would you like to do a review here ? Even better, check on some of your use-cases if this has unfavorable impact ?
Awesome! Yes; I'm quite busy rest of the day but I can look tomorrow! I haven't had time to test the AVX512 changes yet; so I'll try to do base->AVX512->this deltas!
FYI started looking at benchmarks but since we hadn't bumped from version 0.17 I've got a bit of work to do. Symbolication/batch dimensions have changed a bit so it's crashing in concretization. I'll do a review after lunch at least.
In case you have a thought @kali this is the issue:
[/home/ts/Repositories/cervo/crates/cervo-core/src/inferer/helpers.rs:38] &model.symbol_table = unk__16819 unk__16820 unk__16821 unk__16822 N
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Failed analyse for node #65 "model_4_concatenate_concat" InferenceConcat
Caused by:
0: Infering facts
1: Applying rule inputs[0].shape[0] == inputs[1].shape[0]
2: Impossible to unify Sym(unk__16820) with Sym(unk__16819).', src/main.rs:66:62
Happens here on this upgrade PR: https://github.com/EmbarkStudios/cervo/pull/44/files#diff-4a7fb40e77a22ccba64ba761a0c31ab388127a6309b79e1c7832602c1755d3dcR39. If you want to try the code,
$ cd benchmarks/perf-test
$ cargo run batch-scaling 100 "/tmp/dump" -o ../../brains/test-large.onnx -b `seq -s, 1 24`
Will try batch-sizes 1..24 of all the various batching mechanisms cervo has.
It looks to me like a typical case of overspecification of inputs and output in ONNX. (Prior to 0.19, tract was ignoring them, this was a bug. Now it's trying to make sense of them.) Try to cancel the the output_shapes: .with_output_facts(0, Default::default()) immediately after loading the onnx file, before it gets analysed
OK; now I've managed to get this to run. I've compared my current production (0.17) to 0.19.2[^1], main, and this branch. You can see this below. This is a conv-stack of 32x32 followed an MLP [1024,1024]. This branch is definitely an improvement. But there's some weird ones compared to avx256 like getting much worse in the 6-wide case and then much better in the 7-wide case? Maybe I'll see the reason why in the code.

[^1]: It looks like something's fishy with version - main is on 0.19.3-pre, but latest release is 0.19.7? Not sure what to compare against. But I'm hoping there's not a huge difference between 0.19.2 and whatever is used here right now.
Starting review now!
ah, that's kinda bad my kernels are actually slower than avx256? there is something wrong with my methodology then?
Just ignore the pre-version tag on main. main will become 0.20. There is a maintenance branch for 0.19 called 0.19.x
Out of curiosity, have you ran this under anything like VTune or Advisor? I'm curious what they'd say and identify as concerns.
Out of curiosity, have you ran this under anything like VTune or Advisor? I'm curious what they'd say and identify as concerns.
Yes, vtune. Obviously there is something I'm missing here though, so i'll get back on this.
Do you have a script so that I can try and reproduce your results?
Clone cervo on the ts/tract-0193 branch. You'll need to edit the root Cargo.toml and benchmarks/perf-test/Cargo.toml to override to whichever git you want to test with. To use from crates.io you'll also need to change the pinned versions in all other Cargo.toml-s to something like =0.19.2 since 0.19.3-pre isn't on crates.io
Then you can run it like this:
$ cd benchmarks/perf-test
$ cargo run batch-scaling 10000 "some-path/some-name.csv" -o ../../brains/test-large.onnx -b `seq -s, 1 24`
Once you've tested a bunch of PRs you can do
$ python3 ../python/compare_batchsize.py \
some-path/first.csv,some-path/second.csv,... \
test1label,test2label,... \
10000 \
some-path/output.png
The script should handle any number of comparisons, they're all relative to the first file. E.g., mine was
python3 ../python/compare_batchsize.py \
../out/batchsize-avx256-017.csv,../out/batchsize-avx256.csv,../out/batchsize-avx512.csv,../out/batchsize-avx512-imp.csv \
avx256-0.17,avx256-0.19.2,avx512,avx512b \
10000 \
../out/batchsize-compare-dynamic.png
Nowhere near as pretty, but more importantly worse values than yours:

Okay, I did not have that much time to look at this today, but I managed to replicate the regression locally, and looking at vtune, apparently i'm spending wayy too much time on the MMV kernels it looks like?
I was very worried that this was something I wouldnt be able to replicate, seeing that your kernel times are so different. Fortunately it seems that's not the case
I'll look further tomorrow, but something is fishy here and it's my fault :)
How easy would it be to remove loop unrolling? We could be cramping the instruction decoder/cache.
Very easy, it's just setting the loop unroll variable to 1 and commenting the jumps That's a good point, i'll test that when i get back to it
Hey folks, I'm planning on cutting 0.20.x in a week or so. Do we have a path towards making these optimisations a part of it ?