sparse
sparse copied to clipboard
Constructing GCXS from non-canonical `scipy.sparse.csr_matrix` results in wrong results
Describe the bug
GCXS seems to require a canonical CSR data structure, i.e. column indices sorted within a row and without duplicates. Slicing a GCXS array with non-canonical internal data structure gives wrong results. The requirement for canonical data structures is not documented: https://sparse.pydata.org/en/stable/generated/sparse.GCXS.html
scipy.sparse.csr_matrix
doesn't require its data structure to be canonical. The GCXS constructor doesn't seem to check if a passed scipy.sparse.csr_matrix
is canonical. In effect, constructing a GCXS array from a perfectly valid scipy.sparse.csr_matrix
may create a broken GCXS array which gives wrong numerical results. Using the explicit GCXS.from_scipy_sparse()
gives the same behavior.
To Reproduce
This example uses cupyx.scipy.sparse.csr_matrix
since that produces non-canonical CSR data structures when constructed from dense arrays. This highlights the severity of this bug where normal use of the APIs of these packages gives numerically wrong results without causing any warning or error. In my understanding the bug is mostly in GCXS since scipy.sparse.csr_matrix
and cupyx.scipy.sparse.csr_matrix
work perfectly fine with non-canonical data structures.
import numpy as np
import cupy
import cupyx.scipy.sparse as csp
import sparse
# Just some data
source = np.random.random((23, 42))
# Reference slice
ref = source[2:4]
# We make a cupyx.scipy.sparse.csr_matrix from the source data
cupyx_sparse = csp.csr_matrix(cupy.array(source))
# It is not canonical, i.e. has duplicates or unsorted indices
print("Canonical?", cupyx_sparse.has_canonical_format)
# We slice it
r1 = cupy.asnumpy(cupyx_sparse[2:4].todense())
# Seems to work correctly
print("max difference cupyx.scipy.sparse", np.max(np.abs(ref - r1)))
# We make a GCXS matrix by first transferring from the GPU and then
# constructing GCXS
gcxs = sparse.GCXS(cupyx_sparse.get())
# We slice it
r2 = gcxs[2:4].todense()
# It is wrong
print("max difference GCXS", np.max(np.abs(ref - r2)))
Edit: Include output
Canonical? False
max difference cupyx.scipy.sparse 0.0
max difference GCXS 0.9852408955033491
Expected behavior
The expected behavior can be split into two aspects, of which one is preventing a bug with serious impact on downstream code, and the other a feature request.
First, constructing a GCXS array from a valid scipy.sparse.csr_matrix
that GCXS can't handle correctly should throw an error instead of producing a GCXS array that gives wrong results.
Second, it would be good if GCXS could be constructed successfully from a non-canonical scipy.sparse.csr_matrix
.
System
- OS and version: Win 11
-
sparse
version (sparse.__version__
'0.14.0') - NumPy version (
np.__version__
'1.24.4') - Numba version (
numba.__version__
'0.57.1') - CuPy version (
cupy.__version__
'12.1.0')
Additional context
At https://github.com/LiberTEM/sparseconverter/pull/14 I am currently working on tests and workarounds. Perhaps the test matrix in https://github.com/LiberTEM/sparseconverter/blob/main/tests/test_sparseconverters.py could be useful here?
Thanks for the detailed report! Since this is silent invalid results, I'd be inclined to fix it soon.
Hello, what exactly do you mean by canonical here? Can this be reproduced without cupyx
perhaps?
Hello, what exactly do you mean by canonical here? Can this be reproduced without
cupyx
perhaps?
It can be reproduced if the data within a row is not sorted by column index.
See https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.has_canonical_format.html#scipy.sparse.csr_matrix.has_canonical_format for the definition of "canonical"!
This code is a compact reproducer without cupy: https://github.com/LiberTEM/sparseconverter/blob/4cfc0ee2ad4c37b07742db8f3643bcbd858a4e85/src/sparseconverter/init.py#L154-L183