xla icon indicating copy to clipboard operation
xla copied to clipboard

Explore performance of XLA:CPU on ARM.

Open pifon2a opened this issue 1 year ago • 1 comments

@sherhut @d0k @jreiffers

It would be interesting to benchmark XLA:CPU Next on ARM. I am starting this issue to track the progress and also to share information about the code location.

XLA:CPU uses MLIR tiling/fusion/vectorization transformations that exist in both OpenXLA and TF repos.

1. XLA:CPU compiler contains two important parts

  • HloXlaRuntimePipeline MLIR pipeline that goes from HLO to Linalg + tHLO, then performs tiling/fusion and buffer allocation/optimizations and emits structured control flow with scalars, vectors and memrefs.

  • XlaCpuCompilationPipeline that lowers the result of hlo-xla-runtime-pipeline to LLVM.

2. Tiling, fusion and vectorization.

CpuTilingPipeline finds fusion clusters e.g. map(matmul(transpose)), reduce(map); tiles the root, fuses all consumers in and then vectorizes or scalarizes the loop bodies. There are many tests that fuse tHLO/Linalg ops in tests/Dialect/gml_st/cpu_tiling. This pipeline has options that affect tile sizes.

3. Vector optimizations and lowering to SCF.

LowerVectorsPass is launched after bufferization. It rewrites higher-level vector ops, e.g. vector.contract, vector.multi_reduction; optimizes vector.transfer_read/write ops and then lowers the result to SCF by unrolling the vectors.

4. Enabling MLIR pipeline for AOT compilation.

tf_library rule should have mlir_components set to "HloLowering".

pifon2a avatar Mar 03 '23 19:03 pifon2a

tf_library rule should have mlir_components set to "HloLowering".

Or alternatively, depend on the implicitly defined MLIR library (name + '_mlir' suffix).

jreiffers avatar Mar 03 '23 19:03 jreiffers

FWIW, as of 3693f68ceb32cf15ed1c1d2f5b7d88890fcd6af9 I still got a >10x slowdown when running BERT from MLperf with python run.py --backend=tf --scenario SingleStream using XLA-MLIR (TF_XLA_FLAGS="--tf_xla_auto_jit=2 --tf_xla_cpu_global_jit" XLA_FLAGS="--xla_cpu_use_xla_runtime") Vs the default.

Looking at htop the default uses all the cores heavily, whereas it's all very quiet on all cores when using XLA. Is there some obvious flag I'm missing? Any suggested approach to narrow down the issue?

Thanks in advance, Thomas

RoboTux avatar Mar 28 '23 21:03 RoboTux

running BERT from MLperf with python run.py --backend=tf --scenario SingleStream using XLA-MLIR

Would you mind providing a script/instructions to reproduce this? I'm guessing this issue only appears on ARM?

jon-chuang avatar Apr 04 '23 00:04 jon-chuang

@RoboTux is it 10x slowdown compared to XLA:CPU Current or just TF? XLA:CPU Next/Current are single-threaded only, that might be the problem.

pifon2a avatar Apr 04 '23 16:04 pifon2a

Hi there,

Sorry for the late reply. I was comparing default TF (no XLA) with XLA:CPU-Next with auto partitioning on a Graviton 3 system with 16 core (AWS c7g.4xlarge instance). As per your answer I tried in a single threaded setting and the difference is then down to 3x.

To reproduce I just built TF pip package locally, install them, clone mlcommons/inference.git repository and run TF_INTRA_OP_PARALLELISM_THREADS=1 TF_INTER_OP_PARALLELISM_THREADS=1 TF_XLA_FLAGS="--tf_xla_auto_jit=2 --tf_xla_cpu_global_jit" XLA_FLAGS="--xla_cpu_use_xla_runtime" python run.py --backend=tf --scenario SingleStream Vs TF_INTRA_OP_PARALLELISM_THREADS=1 TF_INTER_OP_PARALLELISM_THREADS=1 python run.py --backend=tf --scenario SingleStream

RoboTux avatar Apr 24 '23 08:04 RoboTux