sprs icon indicating copy to clipboard operation
sprs copied to clipboard

Creating CSR array with duplicates fails

Open rth opened this issue 5 years ago • 6 comments

Currently, CsMat::new appears to fail on input that has duplicate values (i.e. when in a CSR matrix multiple indices for a row are the same). They are summed by default, if I read the following comment correctly,

By convention, duplicate locations are summed up when converting into CsMat

An example from https://github.com/vbarrielle/sprs/pull/135,

use sprs::CsMat;

let indptr = vec![0, 5, 8];
let indices = vec![761698, 328290, 828689, 761698, 780185, 901149, 780185, 144749, 258307];
let data = vec![1, 1, 1, 1, 1, 1, 1, 1, 1];
let a = CsMat::new((2, 1000000), indptr, indices, data);

this produces,

thread 'sparse::csmat::test::test_example' panicked at 'Indices length and inpdtr's nnz do not match: 9 != 8', src/sparse/csmat.rs:1095:13

using the code from that PR.

In this case, the initial len(indices) is indeed 9 but it becomes 8 after the duplicates are summed.

FWIW scipy.sparse.csr_matrix in Python (see docs and implementation ) handles this case as follows,

>>> from scipy.sparse import csr_matrix
>>> indptr = [0, 5, 8]
>>> indices = [761698, 328290, 828689, 761698, 780185, 901149, 780185, 144749, 258307]
>>> data = [1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> X = csr_matrix((data, indices, indptr))
>>> len(indices)
9
>>> len(X.indices)
8
>>> X.indptr
array([0, 4, 7], dtype=int32)
>>> X.indices
array([328290, 761698, 780185, 828689, 144749, 780185, 901149],
      dtype=int32)
>>> X.data
array([1, 2, 1, 1, 1, 1, 1])

I can contribute this example as a test if necessary.

rth avatar Nov 11 '18 12:11 rth

After more investigation and reading the docstring more carefully, maybe it's not a bug for CsMat::new but is explicitly unsupported.

Currently, the result is indeed not consistent for this input, where nnz (determined from the last element of indptr) is 8 while indices.len() is 9.

Maybe there is a difference in convention with scipy. Anyway, replacing indptr by,

let indptr = vec![0, 5, 9];

in the above example, I now get,

panicked at 'called `Result::unwrap()` on an `Err` value: NonSortedIndices', libcore/result.rs:945:5

which seems to happen in here.

At the same time, if one has such data, the question is how one is expected to handle it. I could do it manually, but I would need to first sort the indices, and currently that would mean manually copy-pasting this code from sprs, as there is no stand alone API entry for it.

rth avatar Nov 11 '18 19:11 rth

Hello and thanks for this report.

It was indeed a design choice that the basic constructor for CsMat would not do too much magic. Looks like the proper choice would be to add a constructor that is able to produce this summation of repeated indices and sorting of indices. I'll look into that.

I'm however a bit curious about your input data. As I understand it, the last value in your indices value is to be ignored (in scipy)? Why is it there?

vbarrielle avatar Nov 21 '18 16:11 vbarrielle

It was indeed a design choice that the basic constructor for CsMat would not do too much magic

Sounds like a reasonable thing to do, and now I do wonder about the performance cost of always doing this duplicate summing in scipy, even when it's not necessary.

Looks like the proper choice would be to add a constructor that is able to produce this summation of repeated indices and sorting of indices.

Even a util function something like sum_duplicates(indices, indptr, values) -> (indices, values) could be a start.

As I understand it, the last value in your indices value is to be ignored (in scipy)?

All values in indices are used. There is some difference in the indptr convention about the last element I think -- and I also wonder why.

rth avatar Nov 23 '18 07:11 rth

Also, I would be happy to contribute this feature in general.

rth avatar Nov 23 '18 08:11 rth

Looks like the proper choice would be to add a constructor that is able to produce this summation of repeated indices and sorting of indices.

Even a util function something like sum_duplicates(indices, indptr, values) -> (indices, values) could be a start.

Also, I would be happy to contribute this feature in general.

Yes I also think this is the way to go. If you want to contribute this feature, please do. May I suggest that the util function takes its indices and values parameters as &mut pointer? That way integrating this function into a new CsMat constructor would be smoother, and there's no loss in flexibility, one can always clone before passing to this function.

As I understand it, the last value in your indices value is to be ignored (in scipy)?

All values in indices are used. There is some difference in the indptr convention about the last element I think -- and I also wonder why.

I really think the last value has been ignored by scipy. If you look at the sum of the data array you pass, it is 9, but X.data.sum() == 8. So it looks like scipy will only consider your indices and data arrays up to the nnz value specified by the last element of indptr.

vbarrielle avatar Nov 23 '18 08:11 vbarrielle

If you look at the sum of the data array you pass, it is 9, but X.data.sum() == 8

yes, my indptr was incorrect, it should have been indptr = [0, 5, 9] which explains why the last element of indices was ignored.

May I suggest that the util function takes its indices and values parameters as &mut pointer?

Sure, will do.

rth avatar Nov 23 '18 08:11 rth