tpp-mlir
tpp-mlir copied to clipboard
Performance variation in single thread benchmark exeuction
Need to profile what's going on here. 99% of the time is spent on libxsmm calls, so why the large variation and why the compiler is "faster" on Zen and "slower" on Lake?
These numbers are consisten across multiple runs on our cluster, AWS virtual and AWS metal.
FP32:
ZEN3 | DNN | TPP-MLIR | Delta |
---|---|---|---|
FP32 | 105.2 | 113.0 | 107% |
BF16 | 91.5 | 92.6 | 101% |
MLP32 | 105.6 | 112.3 | 106% |
MLP16 | 92.0 | 93.1 | 101% |
CLX | DNN | TPP-MLIR | Delta |
---|---|---|---|
FP32 | 172.8 | 165.3 | 96% |
BF16 | 131.9 | 131.5 | 100% |
MLP32 | 172.2 | 164.8 | 96% |
MLP16 | 131.5 | 131.4 | 100% |
BF16 on SPR:
Some ideas:
- Libxsmm-dnn uses
mmap
to allocate temporary buffers on 2M page boundaries, while the LLVM JITter probably doesn't. - This explain the ICX 4% slowdown, but not the Zen3 7% speedup
- Maybe Zen3 doesn't work well with that practice?
#895 is the same problem, let's merge issues.
I have debugged it to the following extent:
- it's only 1thr issue for very small problems
- when changing libxsmm-dnn to use malloc instead of libxsmm_aligned_malloc the gap gets much smaller (libxsmm-dnn performance drops)
- For larger sizes, e.g. C=K=2048 or Minibatch=1024m single thread performance of libxsmm-dnn and tpp-mlir is identical.
--> "solution" let's run benchmarks on some slightly larger problem sizes, where data is large than 2M pages etc.
We can also try alignment attribute on memref alloc. Doesn't hurt.