iree icon indicating copy to clipboard operation
iree copied to clipboard

Tool to analyze singular values of matrices during execution

Open bjacob opened this issue 1 year ago • 0 comments

Super drafty and not intended for review.

  1. Apply this PR.
  2. Install Eigen3 in a way that cmake (standard FindEigen3.cmake) will find.
  3. cmake -DCMAKE_C_FLAGS=-DIREE_HAL_EXECUTABLE_LIBRARY_CALL_HOOK .
    • Needed for a hook that records the name of the current dispatch function.
  4. ninja (need to rebuild compiler, tools, and the experimental .so here).
  5. Compile your MLIR program with your usual iree-compile command line plus this extra flag:
    • --iree-llvmcpu-link-embedded=false. This is needed so that unresolved symbols are not a linking error at that point, so your module can be linked with the SVD analysis hooks later at runtime.
  6. Run your MLIR program as usual with these additional things:
    • Environment variable (assuming current working dir is IREE build dir):
      • LD_PRELOAD=$PWD/experimental/svd_analysis/libiree_experimental_svd_analysis_svd_analysis.so
    • Flags: force single-threaded execution, the tool is not ready for multi-threaded. Pass either --task_topology_max_group_count=1or --device=local-sync.

Example: to run on BERT-Large:

Get the artifacts (See https://github.com/iree-org/iree/discussions/16246).

Compile:

tools/iree-compile \
  --iree-hal-target-backends=llvm-cpu \
  --iree-llvmcpu-link-embedded=false \
  --iree-llvmcpu-target-cpu=znver4 \
  --iree-llvmcpu-enable-ukernels=mmt4d \
  ~/testing/bert_large_batch1.mlirbc -o /tmp/bert_large_batch1.vmfb

Run:

LD_PRELOAD=$PWD/experimental/svd_analysis/libiree_experimental_svd_analysis_svd_analysis.so \
tools/iree-run-module \
  --module=/tmp/bert_large_batch1.vmfb \
  --function=forward \
  "--input=1x384xi64=[[`seq 1 384`]]" \
  "--input=1x384xi64=[[`seq 11 394`]]" \
  --device=local-sync

Resulting log: https://gist.github.com/bjacob/34fd85f2723233281826b9a2ccad8763

Example log entry for a dispatch:

  CALL: forward_dispatch_5_mmt4d_24x64x1024x16x16x1_f32
    OP: mmt4d
      LHS: 8x1024x16x1xf32
         25.8% of normalized singular values are <=      1 and >    0.3
          3.1% of normalized singular values are <=    0.3 and >    0.1
          1.6% of normalized singular values are <=    0.1 and >   0.03
          1.6% of normalized singular values are <=  0.003 and >  0.001
          0.8% of normalized singular values are <=  0.001 and > 0.0003
          0.8% of normalized singular values are <= 0.0003 and > 0.0001
          1.6% of normalized singular values are <=  1e-06 and >      0
         64.8% of normalized singular values are == 0.
      RHS: 32x1024x16x1xf32
         18.0% of normalized singular values are <=      1 and >    0.3
         60.2% of normalized singular values are <=    0.3 and >    0.1
         21.9% of normalized singular values are <=    0.1 and >   0.03

Note: here we call "normalized singular values" the singular values divided by the largest singular value. So, the largest normalized singular value is always 1 by construction, and the others decrease from there to zero. The question is how fast.

In this dispatch function forward_dispatch_5_mmt4d_24x64x1024x16x16x1_f32, we have a mmt4d op, and we can see that two thirds of its LHS's singular values are zero, meaning that this LHS matrix's rank is only about a third of the generic case of a matrix of this shape, which suggests that if the computation were expressed in a more favorable basis, this LHS matrix could become up to 3x smaller. On the other hand, the RHS matrix does not exhibit any such opportunity, with singular values that are not that small, it really is a full-rank matrix.

bjacob avatar May 07 '24 03:05 bjacob