Tool to analyze singular values of matrices during execution
Super drafty and not intended for review.
- Apply this PR.
- Install Eigen3 in a way that cmake (standard FindEigen3.cmake) will find.
cmake -DCMAKE_C_FLAGS=-DIREE_HAL_EXECUTABLE_LIBRARY_CALL_HOOK .- Needed for a hook that records the name of the current dispatch function.
ninja(need to rebuild compiler, tools, and the experimental .so here).- Compile your MLIR program with your usual
iree-compilecommand line plus this extra flag:--iree-llvmcpu-link-embedded=false. This is needed so that unresolved symbols are not a linking error at that point, so your module can be linked with the SVD analysis hooks later at runtime.
- Run your MLIR program as usual with these additional things:
- Environment variable (assuming current working dir is IREE build dir):
LD_PRELOAD=$PWD/experimental/svd_analysis/libiree_experimental_svd_analysis_svd_analysis.so
- Flags: force single-threaded execution, the tool is not ready for multi-threaded. Pass either
--task_topology_max_group_count=1or--device=local-sync.
- Environment variable (assuming current working dir is IREE build dir):
Example: to run on BERT-Large:
Get the artifacts (See https://github.com/iree-org/iree/discussions/16246).
Compile:
tools/iree-compile \
--iree-hal-target-backends=llvm-cpu \
--iree-llvmcpu-link-embedded=false \
--iree-llvmcpu-target-cpu=znver4 \
--iree-llvmcpu-enable-ukernels=mmt4d \
~/testing/bert_large_batch1.mlirbc -o /tmp/bert_large_batch1.vmfb
Run:
LD_PRELOAD=$PWD/experimental/svd_analysis/libiree_experimental_svd_analysis_svd_analysis.so \
tools/iree-run-module \
--module=/tmp/bert_large_batch1.vmfb \
--function=forward \
"--input=1x384xi64=[[`seq 1 384`]]" \
"--input=1x384xi64=[[`seq 11 394`]]" \
--device=local-sync
Resulting log: https://gist.github.com/bjacob/34fd85f2723233281826b9a2ccad8763
Example log entry for a dispatch:
CALL: forward_dispatch_5_mmt4d_24x64x1024x16x16x1_f32
OP: mmt4d
LHS: 8x1024x16x1xf32
25.8% of normalized singular values are <= 1 and > 0.3
3.1% of normalized singular values are <= 0.3 and > 0.1
1.6% of normalized singular values are <= 0.1 and > 0.03
1.6% of normalized singular values are <= 0.003 and > 0.001
0.8% of normalized singular values are <= 0.001 and > 0.0003
0.8% of normalized singular values are <= 0.0003 and > 0.0001
1.6% of normalized singular values are <= 1e-06 and > 0
64.8% of normalized singular values are == 0.
RHS: 32x1024x16x1xf32
18.0% of normalized singular values are <= 1 and > 0.3
60.2% of normalized singular values are <= 0.3 and > 0.1
21.9% of normalized singular values are <= 0.1 and > 0.03
Note: here we call "normalized singular values" the singular values divided by the largest singular value. So, the largest normalized singular value is always 1 by construction, and the others decrease from there to zero. The question is how fast.
In this dispatch function forward_dispatch_5_mmt4d_24x64x1024x16x16x1_f32, we have a mmt4d op, and we can see that two thirds of its LHS's singular values are zero, meaning that this LHS matrix's rank is only about a third of the generic case of a matrix of this shape, which suggests that if the computation were expressed in a more favorable basis, this LHS matrix could become up to 3x smaller. On the other hand, the RHS matrix does not exhibit any such opportunity, with singular values that are not that small, it really is a full-rank matrix.