tpp-mlir icon indicating copy to clipboard operation
tpp-mlir copied to clipboard

Performance variation in single thread benchmark exeuction

Open rengolin opened this issue 11 months ago • 3 comments

Need to profile what's going on here. 99% of the time is spent on libxsmm calls, so why the large variation and why the compiler is "faster" on Zen and "slower" on Lake?

These numbers are consisten across multiple runs on our cluster, AWS virtual and AWS metal.

FP32:

ZEN3 DNN TPP-MLIR Delta
FP32 105.2 113.0 107%
BF16 91.5 92.6 101%
MLP32 105.6 112.3 106%
MLP16 92.0 93.1 101%
CLX DNN TPP-MLIR Delta
FP32 172.8 165.3 96%
BF16 131.9 131.5 100%
MLP32 172.2 164.8 96%
MLP16 131.5 131.4 100%

BF16 on SPR: 309254652-2a0643ee-c707-4dec-9965-c2a1fe786108

rengolin avatar Feb 29 '24 22:02 rengolin

Some ideas:

  • Libxsmm-dnn uses mmap to allocate temporary buffers on 2M page boundaries, while the LLVM JITter probably doesn't.
  • This explain the ICX 4% slowdown, but not the Zen3 7% speedup
  • Maybe Zen3 doesn't work well with that practice?

rengolin avatar Feb 29 '24 23:02 rengolin

#895 is the same problem, let's merge issues.

I have debugged it to the following extent:

  • it's only 1thr issue for very small problems
  • when changing libxsmm-dnn to use malloc instead of libxsmm_aligned_malloc the gap gets much smaller (libxsmm-dnn performance drops)
  • For larger sizes, e.g. C=K=2048 or Minibatch=1024m single thread performance of libxsmm-dnn and tpp-mlir is identical.

--> "solution" let's run benchmarks on some slightly larger problem sizes, where data is large than 2M pages etc.

alheinecke avatar Mar 01 '24 17:03 alheinecke

We can also try alignment attribute on memref alloc. Doesn't hurt.

rengolin avatar Mar 01 '24 17:03 rengolin