CUDA.jl Multiplication with a symmetrically wrapped CuSparseMatrix returns wrong answer

Describe the bug

Multiplication with a symmetrically wrapped CuSparseMatrix outputs the multiplication with the CuSparseMatrix itself and not the symmetric matrix.

To reproduce

The Minimal Working Example (MWE) for this bug:

Id = [1, 2, 3, 4, 1, 4, 5, 2, 4, 6, 7, 3, 4, 8, 4, 5, 8, 9, 4, 5, 10]
Jd =  [1, 2, 3, 4, 5, 5, 5, 6, 6, 6, 7, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10]
V = [1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 1.0, 2.0, 2.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 2.0, 2.0, 1.0]
T = sparse(Id,Jd,V)
v = ones(10)
Symmetric(T)*v - T*v # Non-zero
# However equivalent on the GPU does not work the same way
Tgpu = CuSparseMatrixCSR(T)
vgpu = CuVector(v)
Symmetric(Tgpu)*vgpu - Tgpu*vgpu # 0

Manifest.toml

Paste your Manifest.toml here, or accurately describe which version of CUDA.jl and its dependencies (GPUArrays.jl, GPUCompiler.jl, LLVM.jl) you are using.

Expected behavior

I expected the behaviour to be similar to the CPU version.

Version info

Details on Julia:

Julia Version 1.11.0
Commit 501a4f25c2b (2024-10-07 11:40 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake-avx512)
Threads: 24 default, 0 interactive, 12 GC (on 24 virtual cores)
Environment:
  JULIA_EXCLUSIVE = 0
  LD_LIBRARY_PATH = /lsf/10.1/linux3.10-glibc2.17-x86_64/lib
  JULIA_NUM_THREADS = 1

Details on CUDA:

CUDA runtime 12.6, artifact installation
CUDA driver 12.7
NVIDIA driver 565.57.1

CUDA libraries: 
- CUBLAS: 12.6.4
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+565.57.1

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0

Toolchain:
- Julia: 1.11.0
- LLVM: 16.0.6

1 device:
  0: Tesla V100-PCIE-16GB (sm_70, 8.250 GiB / 16.000 GiB available)

Nov 29 '24 12:11 mipals

We don't have GPU routines specialized for symmetric matrices. The sparse matrix is just unwrapped from Symmetric or Hermitian.

Nov 29 '24 15:11 amontoison

Yeah, I followed the multiplication and found that a _unwrap(A) is what is causing this. I might be wrong, but I do not think the current behaviour is user-friendly.

I've looked a little into the cuSPARSE documentation, and it does seem like a symmetric type is available and even part of CUSPARSE through CUSPARSE.CUSPARSE_MATRIX_TYPE_SYMMETRIC? In the documentation they do write that it is not very efficient to use the symmetry in mat-vecs but in the case of just using the mat-vec for iterative refinement after having solved a linear system using cuDSS (where only a single side of the matrix is required) it is probably ok/negligible?

Nov 29 '24 15:11 mipals

I agree that it's not optimal but it is not supported by any GPU library (to the best of my knowledge).

I never wanted to add methods with Symmetric because of that for CUSPARSE, but some users wanted this behavior instead of an error. Also Symmetric is not officially a specific storage, it's only used for dispatch in many cases. If we store both triangles, we still need to decide between uplo=:L and uplo=:U.

I checked your type in the CUDA documentation (https://docs.nvidia.com/cuda/cusparse/index.html?highlight=CUSPARSE_MATRIX_TYPE_SYMMETRIC#cusparsematrixtype-t), it part of the legacy API that we removed since a long time here. I will investigate if we can do that with the generic API but I'm not optimistic because even with the legacy API, this format was not supported by many routines.

Nov 29 '24 16:11 amontoison

@mipals I confirm that it is not supported by the new API and it was also not supported for sparse matrix-vector and matrix-matrix products with the legacy API:

https://docs.nvidia.com/cuda/archive/11.8.0/cusparse/index.html#csrmvEx

Nov 29 '24 17:11 amontoison

Ahh, I did not realize that it was only part of the legacy API - and that it was not even supported then. Thats a shame. Thanks for investigating!

I guess I can do it "manually" then as the mat-vecs for both the matrix the transpose is supported. In this case I would need to remove one of the diagonals, however it seems as the standard Diagonal(Tgpu) throws an ERROR: Scalar indexing is disallowed. . What is the best way to get around this?

Dec 01 '24 09:12 mipals

We never implemented a kernel for Diagonal(Tgpu) if Tgpu is a sparse matrix on GPU. We should add one.

Dec 02 '24 17:12 amontoison