cupy `cupyx.scipy.sparse.linalg.splu` solve is slower than `scipy` equivalent

Description

While using cupyx.scipy.sparse.linalg.splu, I noticed that subsequent solve() steps using the SuperLU result returned from cupyx.scipy.sparse.linalg.splu are a lot slower (~120x slower in the example below) than the equivalent scipy operation on the CPU.

To Reproduce

Here's the scipy version (note I'm using the ipython/jupyter %time magic for timing the linalg.splu and .solve lines):

import numpy as np
from scipy.sparse import random, csc_matrix
from scipy.sparse import linalg

size = 10_000
A = random(size, size, density=0.5, random_state=2)
A = csc_matrix(A)
x = np.random.default_rng(seed=2).random(size)
%time B = linalg.splu(A)    # 1min 10s
%time result = B.solve(x)   # 81.5 ms

The output here gives a runtime of 1min 10s for the B = linalg.splu(A) line and 81.5 ms for the result = B.solve(x) line.

Switching to the equivalent cupy version:

import cupy as cp
from cupyx.scipy.sparse import random, csc_matrix
from cupyx.scipy.sparse import linalg

size = 10_000
A = random(size, size, density=0.5, random_state=2)
A = csc_matrix(A)
x = cp.random.default_rng(seed=2).random(size)
%time B = linalg.splu(A)    # 1min 15s
%time result = B.solve(x)   # 9.86 s

In this case, the B = linalg.splu(A) line takes ~1min 15s , which is similar to the scipy CPU version. This is expected as the docs https://docs.cupy.dev/en/stable/reference/generated/cupyx.scipy.sparse.linalg.splu.html say that this step actually just happens on the CPU with scipy anyways. What's surprising is the result = B.solve(x) line, which happens on the GPU, takes 9.86 s, which much slower (~120x slower) than the equivalent scipy step.

Installation

Conda-Forge (conda install ...)

Environment

OS                           : Linux-5.15.0-1068-aws-x86_64-with-glibc2.36
Python Version               : 3.10.14
CuPy Version                 : 13.2.0
CuPy Platform                : NVIDIA CUDA
NumPy Version                : 1.26.4
SciPy Version                : 1.14.1
Cython Build Version         : 0.29.37
Cython Runtime Version       : None
CUDA Root                    : None
nvcc PATH                    : None
CUDA Build Version           : 12050
CUDA Driver Version          : 12060
CUDA Runtime Version         : 12050 (linked to CuPy) / RuntimeError('CuPy failed to load libcudart.so.12: OSError: libcudart.so.12: cannot open shared object file: No such file or directory') (locally installed)
cuBLAS Version               : (available)
cuFFT Version                : 11206
cuRAND Version               : 10307
cuSOLVER Version             : (11, 6, 4)
cuSPARSE Version             : (available)
NVRTC Version                : (12, 6)
Thrust Version               : 200400
CUB Build Version            : 200200
Jitify Build Version         : <unknown>
cuDNN Build Version          : None
cuDNN Version                : None
NCCL Build Version           : None
NCCL Runtime Version         : None
cuTENSOR Version             : None
cuSPARSELt Build Version     : None
Device 0 Name                : NVIDIA A10G
Device 0 Compute Capability  : 86
Device 0 PCI Bus ID          : 0000:00:1E.0

Additional Information

No response

Aug 29 '24 17:08 jrbourbeau

Interesting. I ran on a V100/Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz and saw similar results:

Wall time: 2min 2s (np linalg.splu)
Wall time: 198 ms (np B.solve)

Wall time: 2min 23s (cp linalg.splu)
Wall time: 10.3 (cp B.solve)

Aug 29 '24 21:08 quasiben

For CuPy benchmark, could you try measuring with from cupyx.profiler import benchmark? https://docs.cupy.dev/en/stable/reference/generated/cupyx.profiler.benchmark.html

Aug 30 '24 05:08 kmaehashi

I've done some nsys profiling, which is attached below:

sparse_solve

The image above filters for CUDA API calls from a single B.solve(x) call, and as can be seen the time is dominated by two cudaMemcpyAsync calls that combined take over 8.355 seconds, where the total measured runtime on my system for that particular iteration was 8.3562 seconds. Unfortunately I'm not knowledgeable of linalg.splu.solve, but hopefully someone experienced can provide some input on whether this can be mitigated somehow, as it seems we're only limited by those two copies.

Aug 30 '24 16:08 pentschev

To directly respond to @kmaehashi ask below is the benchmark for splu:

In [3]: benchmark(linalg.splu, (A,), n_repeat=10)
Out[3]: splu                :    CPU: 123319562.751 us   +/- 6660661.268 (min: 117148949.450 / max: 140372162.229) us     GPU-0: 123320463.281 us   +/- 6660711.261 (min: 117149812.500 / max: 140373187.500) us

CPU: 123319562.751 us +/- 6660661.268 (min: 117148949.450 / max: 140372162.229) us
GPU-0: 123320463.281 us +/- 6660711.261 (min: 117149812.500 / max: 140373187.500) us

Aug 30 '24 16:08 quasiben

Q for @pentschev: Didn't @jrbourbeau already figure out the source of the data roundtrip in #8570? Or it's a different thing?

Sep 16 '24 04:09 leofang

@leofang it is unclear whether they are related or not, the cudaMemcpyAsync calls happen during the B.solve(x) call, where B = linalg.splu(A). Will solve() also be executed on the CPU? If that is expected then it is indeed related to #8570 , but still the cudaMemcpyAsync calls cause the process to be 120x slower than directly operating on CPU memory, which probably feels like a bug for users. Do you have any ideas why cudaMemcpyAsync would take >4s each? Is it because of the memory copy pattern for sparse matrices?

Sep 16 '24 07:09 pentschev

My understanding (which could be wrong) is that these are separate issues. https://github.com/cupy/cupy/issues/8570 is about the GPU --> CPU copy here

https://github.com/cupy/cupy/blob/3e2b8ed75ea8f8cd86258b44d506d62f35ccb1d9/cupyx/scipy/sparse/linalg/_solve.py#L705

which happens prior to constructing the SuperLU object returned by linalg.splu. The slowdown reported here is related to subsequent SuperLU.solve(...) calls that happen later.

Sep 16 '24 15:09 jrbourbeau

I did some digging into the problem. I found out that the LU factors that come out of SuperLU may not be sparse, as a matter of fact when I tested it locally with a random sparse matrix A, the factors had about 48% density which is barely sparse considering that they are both triangular matrices by definition. Now I do not know how the random nature of A affects the sparsity of the factors, maybe in a closer to reality problem, they would actually be sparse.

When the factors are not sparse, the spsm CUSPARSE call that happens in the background to solve the respective triangular linear systems during the .solve() call, becomes very slow. It has several steps, one of which is spsm_analysis and that takes the most time. In comparison, when you try to solve the linear systems with the respective dense CUBLAS trsm routine, the .solve() call takes about 150 ms ( still slower than the CPU version but much faster than CUSPARSE ). Feel free to verify this.

Oct 28 '24 20:10 aloumakos

We are seeing that every for extremely sparse (1e-7) matrices, repeated solves of the cupy SuperLU factorization are spending lots of time in spsm_analysis.

According to the nvidia help forums, it seems like cusparseSpSV_analysis can be precomputed and reused when solving against multiple RHS. It would be great for the cupy SuperLU wrappers to do this as it could provide a substantial speedup.

Jul 25 '25 16:07 kburns