linear_operator Fast KroneckerProduct.matmul, t

Hello linear_operator developers,

I have developed a library, FastKron (https://github.com/abhijangda/fastkron), to do fast matrix kronecker-matrix multiply and kronecker-matrix matrix multiply. FastKron performs orders of magnitude faster (0.9x to 21x) than current algorithm on both x86 CPUs and NVIDIA GPUs. The python module, PyFastKron, provides a PyTorch interface with backward pass. You can find more information at https://github.com/abhijangda/fastkron .

This PR integrates KroneckerProductLinearOperator._matmul, KroneckerProductLinearOperator._tmatmul, and KroneckerProductLinearOperator.rmatmul. Looking forward to your reviews and happy to do any changes.

Thank You

Dec 08 '24 01:12 abhijangda

It looks like the CI uses Python 3.8. PyFastKron is build for Python >= 3.9 because PyTorch requires >= 3.9. I can build PyFastKron for 3.8 but I think ideal would be upgrade the Python in CI to >= 3.9 . Let me know what you prefer.

Dec 08 '24 18:12 abhijangda

We should just upgrade to 3.9+ as py3.8 is EOL anyway.

cc @jandylin, @SebastianAment re the Kronecker library

Dec 08 '24 23:12 Balandat

Thanks for upgrading the Python version to 3.10. It looks like a workflow approval is needed to execute CI tests. Would be great if you can approve it and I am happy to answer any questions about FastKron.

Feb 03 '25 00:02 abhijangda

FastKron performs orders of magnitude faster (0.9x to 21x) than current algorithm on both x86 CPUs and NVIDIA GPUs

Can you share the benchmarks that you ran for this?

Feb 04 '25 14:02 Balandat

Install pyfastkron using pip:

pip install -U pyfastkron

Clone the repository including submodules:

git clone --recurse-submodules https://github.com/abhijangda/fastkron.git

I also recommend installing TCMalloc using conda install conda-forge::gperftools or in Ubuntu as sudo apt install google-perftools libgoogle-perftools-dev. TCMalloc is significantly faster than Python's default Glibc malloc. Using TCMalloc or Glibc malloc would not matter for GPU performance but for CPU TCMalloc will remove the bottleneck from Python's GC on CPU. Based on how you install TCMalloc the LD_PRELOAD will change. For conda installation: LD_PRELOAD=<anaconda-env-path>/lib/libtcmalloc.so For apt installation: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 These instructions are also available in the run_benchmarks.py script.

To evaluate Matrix Kronecker-Matrix (MKM) product (rmatmul in linear_operator) and Kronecker-Matrix Matrix (KMM) product (matmul in linear_operator) using Float and Double on CPU:

LD_PRELOAD=<LD-PRELOAD-PATH> TCMALLOC_RELEASE_RATE=0 python tests/benchmarks/run_benchmarks.py -backend x86 -types float double -dataset large -mmtype kmm mkm -use-pymodule

Similarly, for an NVIDIA GPU:

python tests/benchmarks/run_benchmarks.py -backend cuda -types float double -dataset large -mmtype kmm mkm -use-pymodule

The above scripts use large dataset, where Kronecker matrix are large, like Kronecker matrix of 5 factors of size 8,8 or 2 factors of size 128,128. The large dataset will take a couple of hours for CPU and 1 hour for GPU. The full dataset will run for cases where Kronecker factors are multiple of 2. The full dataset will take few 10s of hours to run.

Feb 04 '25 18:02 abhijangda

Also, existing results for some benchmarks over GPyTorch on V100/A100 and AMD CPUs with AVX and AVX512 are here: https://github.com/abhijangda/FastKron/blob/main/documents/performance.md .

Feb 06 '25 22:02 abhijangda

Hello, I was wondering if you were able to run these benchmarks and if there are any more questions.

Mar 07 '25 01:03 abhijangda

Fast KroneckerProduct.matmul, t_matmul and rmatmul