pyculib Problem setting matrixtype in pyculib.sparse.Sparse.matdescr

I have a problem with the matrix descriptor class, when giving parameters different from default ones :

>>> import pyculib
>>> handle = pyculib.sparse.Sparse()
>>> handle
<pyculib.sparse.api.Sparse object at 0x7f2bc75dbe48>
>>> descr = handle.matdescr()
>>> descr.matrixtype
'G'
>>> descrS = handle.matdescr(matrixtype='S')
>>> descrS.matrixtype
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "~/anaconda3/envs/OTML/lib/python3.6/site-packages/pyculib/sparse/binding.py", line 201, in matrixtype
    return MATRIXTYPECHAR[self._api.cusparseGetMatType(self._handle)]
  File "~/anaconda3/envs/OTML/lib/python3.6/site-packages/pyculib/utils/libutils.py", line 48, in wrapped
    self.check_error(status)
  File "~/anaconda3/envs/OTML/lib/python3.6/site-packages/pyculib/utils/libutils.py", line 54, in check_error
    raise self.ErrorType(status)
pyculib.sparse.binding.CuSparseError: CUSPARSE_STATUS_NOT_INITIALIZED

>>> descrT = handle.matdescr(matrixtype='T')
>>> descrT.matrixtype
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/home/matthieu/anaconda3/envs/OTML/lib/python3.6/site-packages/pyculib/sparse/binding.py", line 201, in matrixtype
    return MATRIXTYPECHAR[self._api.cusparseGetMatType(self._handle)]
  File "/home/matthieu/anaconda3/envs/OTML/lib/python3.6/site-packages/pyculib/utils/libutils.py", line 48, in wrapped
    self.check_error(status)
  File "/home/matthieu/anaconda3/envs/OTML/lib/python3.6/site-packages/pyculib/utils/libutils.py", line 54, in check_error
    raise self.ErrorType(status)
pyculib.sparse.binding.CuSparseError: CUSPARSE_STATUS_INVALID_VALUE

>>> descrH = handle.matdescr(matrixtype='H')
>>> descrH.matrixtype
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/home/matthieu/anaconda3/envs/OTML/lib/python3.6/site-packages/pyculib/sparse/binding.py", line 201, in matrixtype
    return MATRIXTYPECHAR[self._api.cusparseGetMatType(self._handle)]
  File "/home/matthieu/anaconda3/envs/OTML/lib/python3.6/site-packages/pyculib/utils/libutils.py", line 48, in wrapped
    self.check_error(status)
  File "/home/matthieu/anaconda3/envs/OTML/lib/python3.6/site-packages/pyculib/utils/libutils.py", line 54, in check_error
    raise self.ErrorType(status)
pyculib.sparse.binding.CuSparseError: CUSPARSE_STATUS_ALLOC_FAILED

Any ideas of what could be happening ? I tried to debug it, but the function that returns the error is self._api.cusparseGetMatType and I can't access inside.

Other than that, I am also confused by the matrix descriptor documentation in CUDA. The text says:

This type indicates the type of matrix stored in sparse storage. Notice that for symmetric, Hermitian and triangular matrices only their lower or upper part is assumed to be stored.

The whole idea of matrix type and fill mode is to keep minimum storage for symmetric/Hermitian matrix, and also to take advantage of symmetric property on SpMV (Sparse Matrix Vector multiplication). To compute y=A*x when A is symmetric and only lower triangular part is stored, two steps are needed. First step is to compute y=(L+D)*x and second step is to compute y=L^T*x + y. Given the fact that the transpose operation y=L^T*x is 10x slower than non-transpose version y=L*x, the symmetric property does not show up any performance gain. It is better for the user to extend the symmetric matrix to a general matrix and apply y=A*x with matrix type CUSPARSE_MATRIX_TYPE_GENERAL.

Does that mean I should just use the General matrix type even if my matrix is symmetric ? But then why say that The whole idea of matrix type and fill mode is to keep minimum storage for symmetric/Hermitian matrix, **and also to take advantage of symmetric property on SpMV** ?

I know that this might be more a CUDA question than a pyculib, but maybe someone has the answer...

My config:

CUDA 8.0 Ubuntu 16.04 pyculib 1.0.2+2.g7ae9662

$ conda info
     active environment : None
                 ...
          conda version : 4.5.12
    conda-build version : 3.0.27
         python version : 3.6.8.final.0
                    ...
               platform : linux-64
             user-agent : conda/4.5.12 requests/2.18.4 CPython/3.6.8 Linux/4.4.0-141-generic ubuntu/16.04 glibc/2.23
                UID:GID : 1000:1000
             netrc file : None
           offline mode : False

Jan 08 '19 09:01 matthieuheitz

Thanks for the report. I think there may be a bug in the CUDA initialisation sequence logic, I can produce what you see.

Other than that, I am also confused by the matrix descriptor documentation in CUDA. The text says:

I think what it means is that if you want to compute y := A * x where A is symmetric but specifically only the lower part is stored, then better performance is achieved through storing A as a sparse general matrix type, as doing y=:(L+D)*x (compute the (lower triangle + diagonal) * x product), then doing y: = L^T * x + y (compute the lower triangle transposed * x + the result of the first computation) is really expensive as L^T * x is very slow.

Feb 13 '19 09:02 stuartarchibald

Thanks for the report. I think there may be a bug in the CUDA initialisation sequence logic, I can produce what you see.

Okay, did you find a corresponding bug report in the CUDA project, or should we file it ?

For the documentation, I guess what confuses me is this sentence:

To compute y=A*x when A is symmetric and only lower triangular part is stored

Does it mean that A is not a "real" object, and that you only have L and D as real objects in memory ? Or is there a documented way to have a real object A of a symmetric matrix but that only stores the lower part ?

I don't understand the point of having a descriptor for symmetric/hermitian/triangular matrices if it is actually advised to use the general type. In which case is this cusparseMatrixType_t useful ?

Feb 13 '19 14:02 matthieuheitz

Okay, did you find a corresponding bug report in the CUDA project, or should we file it ?

The bug is in this project's CUDA initialisation sequence logic, not in the CUDA driver or runtime itself.

For the documentation, I guess what confuses me is this sentence:

It's really common to use a variety of storage schemes to reduce memory use, improve locality, etc. Here's some common ones: Dense: https://www.netlib.org/lapack/lug/node121.html Sparse: http://netlib.org/linalg/html_templates/node90.html

Does it mean that A is not a "real" object, and that you only have L and D as real objects in memory ? Or is there a documented way to have a real object A of a symmetric matrix but that only stores the lower part ?

Conceptually, A here is just a matrix, how it is stored and how the multiplication operation with a vector x is performed is implementation detail. The cusparseMatDescr_t documentation explains how to indicate that a stored matrix has certain properties and then based on those properties CUDA will run an appropriate algorithm for the matrix vector product.

I don't understand the point of having a descriptor for symmetric/hermitian/triangular matrices if it is actually advised to use the general type. In which case is this cusparseMatrixType_t useful ?

The only case for which the general type is advised is if you have a sparse matrix which is symmetrical but to save space you have decided to store only the lower triangular part. Then with this matrix if you want to compute the matrix-vector product the CUDA docs are advising to switch the storage format to a general type (by using more memory through transposing and copying the lower triangle into the upper to form a "general" sparse matrix) as this will yield better performance.

Here is an example of performing two mat-vec products y:=A*x, using exactly the same sparse matrix data throughout, first doing a mat-vec operation telling CUDA that it is a general sparse matrix (i.e. CUDA will interpret the matrix literally as it is presented), and then subsequently, with no change to the original sparse matrix data, telling CUDA to interpret the sparse matrix data as a symmetric matrix where the data is stored in the upper triangle.

import numpy as np
import scipy.sparse.linalg
import pyculib

handle = pyculib.sparse.Sparse()
dtype = np.float32
m = n = 3
trans = 'N'

# Initialize the CSR matrix on the host and GPU.
row = np.array([0, 0, 0, 1, 2])
col = np.array([0, 1, 2, 1, 2])
data = np.array([0.431663, 0.955176, 0.925239, 0.569277, 0.48015], dtype=dtype)

csrMatrixCpu = scipy.sparse.csr_matrix((data, (row, col)), shape=(m, n))
csrMatrixGpu = pyculib.sparse.csr_matrix((data, (row, col)), shape=(m, n))

x = np.ones((3,), dtype)

print("Upper triangular CPU matrix:")
print(csrMatrixCpu)
print(csrMatrixCpu.todense())

# grab the properties
nnz = csrMatrixGpu.nnz
csrVal = csrMatrixGpu.data
csrRowPtr = csrMatrixGpu.indptr
csrColInd = csrMatrixGpu.indices


alpha = 1.0
beta = 0.

# get cpu result
cpuResult = csrMatrixCpu.dot(x)

# get gpu result
descr = handle.matdescr(0, 'N', 'U', 'G') # NOTE THE 'G'
gpuResult = np.zeros(m, dtype=dtype)
handle.csrmv(trans, m, n, nnz, alpha, descr, csrVal, csrRowPtr, csrColInd, x, beta, gpuResult)

print("Results of y:=A*x")
print('gpu result = ' + str(gpuResult))
print('cpu result = ' + str(cpuResult))

# now tell CUDA that our matrix, which is upper triangular, should be interpreted
# as being symmetric, the matrix data doesn't change, but the algorithm used by
# CUDA internally does.
descr = handle.matdescr(0, 'N', 'U', 'S') # NOTE THE 'S'
gpuResult = np.zeros(m, dtype=dtype)
handle.csrmv(trans, m, n, nnz, alpha, descr, csrVal, csrRowPtr, csrColInd, x, beta, gpuResult)


# create a new CPU matrix that is symmetric to check CUDA result
csrMatrixCpuSymmetric = scipy.sparse.csr_matrix((data, (row, col)), shape=(m, n))
csrMatrixCpuSymmetric += scipy.sparse.tril(csrMatrixCpuSymmetric.T, -1)
cpuResult = csrMatrixCpuSymmetric.dot(x)

print("\n\nSymmetric CPU matrix:")
print(csrMatrixCpuSymmetric.todense())
print(csrMatrixCpuSymmetric)

print("Results of y:=A*x:")
print('gpu result = ' + str(gpuResult))
print('cpu result = ' + str(cpuResult))

Feb 14 '19 11:02 stuartarchibald

Oh okay thanks, now I understand better :+1: I didn't know that there existed specific storage for non-sparse symmetric matrices, as mentioned here https://www.netlib.org/lapack/lug/node121.html, since I haven't seen an implementation of it yet. Your example is great for how to only store one half of a symmetric matrix, if it's sparse. If I understand correctly:

Performing a mat-vec product with a sparse symmetric matrix, storing only half of it, is possible in pyculib, but not in scipy
Performing a mat-vec product with a non-sparse symmetric matrix, storing only half of it, is not possible in pyculib, nor in scipy, unless you use a sparse container for your non-sparse matrix and you come back to the previous case.

I think what confused me in the doc was:

It is better for the user to extend the symmetric matrix to a general matrix

I didn't understand that "symmetric matrix" here, means a matrix of which you only store half of the non-zeros, and "general matrix" means that you store all non-zero elements. Because in both cases, your matrix is still technically symmetric. There was a confusion on whereas 'symmetric' and 'general' referred to the mathematical object or the way it is stored.

Anyway, thank you very much for these clarifications ! :)

Feb 18 '19 09:02 matthieuheitz

Oh okay thanks, now I understand better +1

Excellent, glad this has helped.

I didn't know that there existed specific storage for non-sparse symmetric matrices, as mentioned here https://www.netlib.org/lapack/lug/node121.html, since I haven't seen an implementation of it yet.

It is unlikely you'll hit them/see them unless you work at a much lower level (C/Fortran), but it's something of which to be aware.

If I understand correctly:

Performing a mat-vec product with a sparse symmetric matrix, storing only half of it, is possible in pyculib, but not in scipy

I believe that's the case. PyCulib exposes the sparse BLAS routines in a relatively raw fashion and its matrix layout representation is necessarily raw/complicated in comparison to a e.g. SciPy CSR.

Performing a mat-vec product with a non-sparse symmetric matrix, storing only half of it, is not possible in pyculib, nor in scipy, unless you use a sparse container for your non-sparse matrix and you come back to the previous case.

If you use a packed matrix format then the BLAS {s,d}spmv routine will permit a symmetric matrix vector product using just the upper or lower triangle in packed storage (use {c,z}hpmv for Hermitian) . https://pyculib.readthedocs.io/en/latest/cublas.html#pyculib.blas.Blas.spmv. This has the minimal storage requirement for a symmetric/Hermitian dense matrix system and also uses an optimised routine for performing mat-vec on such a system as the algorithm makes use of the fact it knows it is symmetric. I don't think SciPy has this exposed other than in the raw cython bindings (scipy.linalg.cython_blas).

I didn't understand that "symmetric matrix" here, means a matrix of which you only store half of the non-zeros, and "general matrix" means that you store all non-zero elements. Because in both cases, your matrix is still technically symmetric. There was a confusion on whereas 'symmetric' and 'general' referred to the mathematical object or the way it is stored.

It can certainly be confusing at times. In general there's a necessity to consider the nature of the mathematical operation, the data storage layout, and what an algorithm performing the operation expects of the stored data. This is in part why the interfaces for linear algebra libraries are large, the combinations expand quite rapidly!

Anyway, thank you very much for these clarifications ! :)

No problem at all.

Feb 18 '19 10:02 stuartarchibald