Multiplication with a symmetrically wrapped CuSparseMatrix returns wrong answer
Describe the bug
Multiplication with a symmetrically wrapped CuSparseMatrix outputs the multiplication with the CuSparseMatrix itself and not the symmetric matrix.
To reproduce
The Minimal Working Example (MWE) for this bug:
Id = [1, 2, 3, 4, 1, 4, 5, 2, 4, 6, 7, 3, 4, 8, 4, 5, 8, 9, 4, 5, 10]
Jd = [1, 2, 3, 4, 5, 5, 5, 6, 6, 6, 7, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10]
V = [1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 1.0, 2.0, 2.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 2.0, 2.0, 1.0]
T = sparse(Id,Jd,V)
v = ones(10)
Symmetric(T)*v - T*v # Non-zero
# However equivalent on the GPU does not work the same way
Tgpu = CuSparseMatrixCSR(T)
vgpu = CuVector(v)
Symmetric(Tgpu)*vgpu - Tgpu*vgpu # 0
Manifest.toml
Paste your Manifest.toml here, or accurately describe which version of CUDA.jl and its dependencies (GPUArrays.jl, GPUCompiler.jl, LLVM.jl) you are using.
Expected behavior
I expected the behaviour to be similar to the CPU version.
Version info
Details on Julia:
Julia Version 1.11.0
Commit 501a4f25c2b (2024-10-07 11:40 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 24 × Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, skylake-avx512)
Threads: 24 default, 0 interactive, 12 GC (on 24 virtual cores)
Environment:
JULIA_EXCLUSIVE = 0
LD_LIBRARY_PATH = /lsf/10.1/linux3.10-glibc2.17-x86_64/lib
JULIA_NUM_THREADS = 1
Details on CUDA:
CUDA runtime 12.6, artifact installation
CUDA driver 12.7
NVIDIA driver 565.57.1
CUDA libraries:
- CUBLAS: 12.6.4
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+565.57.1
Julia packages:
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0
Toolchain:
- Julia: 1.11.0
- LLVM: 16.0.6
1 device:
0: Tesla V100-PCIE-16GB (sm_70, 8.250 GiB / 16.000 GiB available)
We don't have GPU routines specialized for symmetric matrices. The sparse matrix is just unwrapped from Symmetric or Hermitian.
Yeah, I followed the multiplication and found that a _unwrap(A) is what is causing this. I might be wrong, but I do not think the current behaviour is user-friendly.
I've looked a little into the cuSPARSE documentation, and it does seem like a symmetric type is available and even part of CUSPARSE through CUSPARSE.CUSPARSE_MATRIX_TYPE_SYMMETRIC? In the documentation they do write that it is not very efficient to use the symmetry in mat-vecs but in the case of just using the mat-vec for iterative refinement after having solved a linear system using cuDSS (where only a single side of the matrix is required) it is probably ok/negligible?
I agree that it's not optimal but it is not supported by any GPU library (to the best of my knowledge).
I never wanted to add methods with Symmetric because of that for CUSPARSE, but some users wanted this behavior instead of an error.
Also Symmetric is not officially a specific storage, it's only used for dispatch in many cases. If we store both triangles, we still need to decide between uplo=:L and uplo=:U.
I checked your type in the CUDA documentation (https://docs.nvidia.com/cuda/cusparse/index.html?highlight=CUSPARSE_MATRIX_TYPE_SYMMETRIC#cusparsematrixtype-t), it part of the legacy API that we removed since a long time here. I will investigate if we can do that with the generic API but I'm not optimistic because even with the legacy API, this format was not supported by many routines.
@mipals I confirm that it is not supported by the new API and it was also not supported for sparse matrix-vector and matrix-matrix products with the legacy API:
- https://docs.nvidia.com/cuda/archive/11.8.0/cusparse/index.html#csrmvEx
Ahh, I did not realize that it was only part of the legacy API - and that it was not even supported then. Thats a shame. Thanks for investigating!
I guess I can do it "manually" then as the mat-vecs for both the matrix the transpose is supported. In this case I would need to remove one of the diagonals, however it seems as the standard Diagonal(Tgpu) throws an ERROR: Scalar indexing is disallowed. . What is the best way to get around this?
We never implemented a kernel for Diagonal(Tgpu) if Tgpu is a sparse matrix on GPU. We should add one.