flow_stability
flow_stability copied to clipboard
Index sorting in the `csr_` operations
Comparing the output of the
csr_methods with native implementations we get mismatches on.dataand.indices(see e.g. https://github.com/alexbovet/flow_stability/actions/runs/10865864561/job/30152834847#step:6:102).A smaller example illustrates this better (also comparing
csr_addto a native addition):# data [0.16911086 0.26525225 0.04051438 0.09085281 0.55254878 0.82202173 0.51055564 0.50439578 0.85289212 0.74480397] # native [0.16911086 0.26525225 0.04051438 0.09085281 0.55254878 0.82202173 0.51055564 0.85289212 0.50439578 0.74480397] # csr_add # indicies [0 1 1 2 3 3 4 5 7 9] # native [0 1 1 2 3 3 4 7 5 9] # csr_add # indptr [ 0 0 1 2 2 4 5 5 7 10 10] # native [ 0 0 1 2 2 4 5 5 7 10 10] # csr_addBasically, the indices are not ordered within a
indptrslice butdatamatch theindices(at least in this example). Not sure if that leads to any trouble down the road.
It is expected that the indices may not be sorted after some operations. csr matrices have a has_sorted_indices attribute.
If you do a sort_indices() and eliminate_zeros() before the np.testing.assert_allclose(), it should hopefully fix the problem.
Originally posted by @alexbovet in https://github.com/alexbovet/flow_stability/issues/35#issuecomment-2353056512
Regarding the tests, there is now a little helper function that allows to compare matrices with not necessarily sorted indices:
https://github.com/alexbovet/flow_stability/blob/b658f27a67921b5f69d192ebbe8c058c8cfea73d/tests/conftest.py#L75-L88
@alexbovet
Regarding the csr_ operations, would it make sense to perform an in-place sort_indices() before returning the csr_matrix?
@alexbovet Regarding the
csr_operations, would it make sense to perform an in-place sort_indices() before returning thecsr_matrix?
Is it necessary?
@alexbovet Regarding the
csr_operations, would it make sense to perform an in-place sort_indices() before returning thecsr_matrix?Is it necessary?
Unsure. It could matter, depending on how operations are implemented. If it is looped over the indices and some </> are used, then we might skip some indices.
This will, in general, depend on the implementation, I guess.
Having just a quick look at our own csr_ operations we use something a bit along these lines in the cython_csr_csrT_matmul function:
https://github.com/alexbovet/flow_stability/blob/8648a8bd715e3b086ed13cb7453d41754c176398/src/flowstab/_cython_sparse_stoch.pyx#L331-L334
@alexbovet OK in our implementations it can matter!: If we take the example data from the comment:
import numpy as np
from scipy.sparse import csr_matrix
from flowstab.SparseStochMat import csr_csrT_matmul
n_data = np.array([0.16911086, 0.26525225, 0.04051438, 0.09085281, 0.55254878, 0.82202173,
0.51055564, 0.50439578, 0.85289212, 0.74480397]) # native
data = np.array([0.16911086, 0.26525225, 0.04051438, 0.09085281, 0.55254878, 0.82202173,
0.51055564, 0.85289212, 0.50439578, 0.74480397]) # csr_add
# indicies
n_indices = np.array([0, 1, 1, 2, 3, 3, 4, 5, 7, 9]) # native
indices = np.array([0, 1, 1, 2, 3, 3, 4, 7, 5, 9]) # csr_add
# indptr
n_indptr = np.array([ 0, 0, 1, 2, 2, 4, 5, 5, 7, 10, 10]) # native
indptr = np.array([ 0, 0, 1, 2, 2, 4, 5, 5, 7, 10, 10]) # csr_add
An = csr_matrix((n_data, n_indices, n_indptr))
A = csr_matrix((data, indices, indptr))
np.testing.assert_equal(An.toarray(), A.toarray()) # make sure it's the same array repr.
so we have A as unsorted output and An as sorted, and use csr_csrT_matmul:
In [22]: np.testing.assert_equal(csr_csrT_matmul(A, A).toarray(), csr_csrT_matmul(An, An).toarray())
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[22], line 1
----> 1 np.testing.assert_equal(csr_csrT_matmul(A, A).toarray(), csr_csrT_matmul(An, An).toarray())
...
AssertionError:
Arrays are not equal
Mismatched elements: 1 / 100 (1%)
Max absolute difference among violations: 0.2544151
Max relative difference among violations: 0.16557306
ACTUAL: array([[0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ],
[0. , 0.028598, 0. , 0. , 0. , 0. ,...
DESIRED: array([[0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ],
[0. , 0.028598, 0. , 0. , 0. , 0. ,...
If we sort the indices of A things are OK again:
In [23]: A.sort_indices()
In [23]: np.testing.assert_equal(csr_csrT_matmul(A, A).toarray(), csr_csrT_matmul(An, An).toarray())
In [24]:
Unless I'm doing something illogical here, it seems to me that we should sort the csr matrices before returning them.
Checking if the input matrices have sorted indices could be a good thing.
Okay, I see. It matters for the Cython code I wrote. But I guess the scipy functions take into account that indices may not be sorted. In the end, I don't think I used my Cython implementation of the matrices operations for stoch_mat. I always converted to scipy csr to multiply or add them as it was faster. But anyway, it would be a good idea to force a sorting of the indices before calling any cython functions.