tpp-mlir Performance variation in single thread benchmark exeuction

Performance variation in single thread benchmark exeuction

Open rengolin opened this issue 11 months ago • 3 comments

Need to profile what's going on here. 99% of the time is spent on libxsmm calls, so why the large variation and why the compiler is "faster" on Zen and "slower" on Lake?

These numbers are consisten across multiple runs on our cluster, AWS virtual and AWS metal.

FP32:

ZEN3	DNN	TPP-MLIR	Delta
FP32	105.2	113.0	107%
BF16	91.5	92.6	101%
MLP32	105.6	112.3	106%
MLP16	92.0	93.1	101%

CLX	DNN	TPP-MLIR	Delta
FP32	172.8	165.3	96%
BF16	131.9	131.5	100%
MLP32	172.2	164.8	96%
MLP16	131.5	131.4	100%

BF16 on SPR: 309254652-2a0643ee-c707-4dec-9965-c2a1fe786108

Feb 29 '24 22:02 rengolin

Some ideas:

Libxsmm-dnn uses mmap to allocate temporary buffers on 2M page boundaries, while the LLVM JITter probably doesn't.
This explain the ICX 4% slowdown, but not the Zen3 7% speedup
Maybe Zen3 doesn't work well with that practice?

Feb 29 '24 23:02 rengolin

#895 is the same problem, let's merge issues.

I have debugged it to the following extent:

it's only 1thr issue for very small problems
when changing libxsmm-dnn to use malloc instead of libxsmm_aligned_malloc the gap gets much smaller (libxsmm-dnn performance drops)
For larger sizes, e.g. C=K=2048 or Minibatch=1024m single thread performance of libxsmm-dnn and tpp-mlir is identical.

--> "solution" let's run benchmarks on some slightly larger problem sizes, where data is large than 2M pages etc.

Mar 01 '24 17:03 alheinecke

We can also try alignment attribute on memref alloc. Doesn't hurt.

Mar 01 '24 17:03 rengolin

tpp-mlir tpp-mlir copied to clipboard

Performance variation in single thread benchmark exeuction

tpp-mlir
tpp-mlir copied to clipboard