sprs
sprs copied to clipboard
Creating CSR array with duplicates fails
Currently, CsMat::new
appears to fail on input that has duplicate values (i.e. when in a CSR matrix multiple indices for a row are the same). They are summed by default, if I read the following comment correctly,
By convention, duplicate locations are summed up when converting into CsMat
An example from https://github.com/vbarrielle/sprs/pull/135,
use sprs::CsMat;
let indptr = vec![0, 5, 8];
let indices = vec![761698, 328290, 828689, 761698, 780185, 901149, 780185, 144749, 258307];
let data = vec![1, 1, 1, 1, 1, 1, 1, 1, 1];
let a = CsMat::new((2, 1000000), indptr, indices, data);
this produces,
thread 'sparse::csmat::test::test_example' panicked at 'Indices length and inpdtr's nnz do not match: 9 != 8', src/sparse/csmat.rs:1095:13
using the code from that PR.
In this case, the initial len(indices)
is indeed 9 but it becomes 8 after the duplicates are summed.
FWIW scipy.sparse.csr_matrix
in Python (see docs and implementation ) handles this case as follows,
>>> from scipy.sparse import csr_matrix
>>> indptr = [0, 5, 8]
>>> indices = [761698, 328290, 828689, 761698, 780185, 901149, 780185, 144749, 258307]
>>> data = [1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> X = csr_matrix((data, indices, indptr))
>>> len(indices)
9
>>> len(X.indices)
8
>>> X.indptr
array([0, 4, 7], dtype=int32)
>>> X.indices
array([328290, 761698, 780185, 828689, 144749, 780185, 901149],
dtype=int32)
>>> X.data
array([1, 2, 1, 1, 1, 1, 1])
I can contribute this example as a test if necessary.
After more investigation and reading the docstring more carefully, maybe it's not a bug for CsMat::new
but is explicitly unsupported.
Currently, the result is indeed not consistent for this input, where nnz
(determined from the last element of indptr
) is 8 while indices.len()
is 9.
Maybe there is a difference in convention with scipy. Anyway, replacing indptr
by,
let indptr = vec![0, 5, 9];
in the above example, I now get,
panicked at 'called `Result::unwrap()` on an `Err` value: NonSortedIndices', libcore/result.rs:945:5
which seems to happen in here.
At the same time, if one has such data, the question is how one is expected to handle it. I could do it manually, but I would need to first sort the indices, and currently that would mean manually copy-pasting this code from sprs, as there is no stand alone API entry for it.
Hello and thanks for this report.
It was indeed a design choice that the basic constructor for CsMat
would not do too much magic. Looks like the proper choice would be to add a constructor that is able to produce this summation of repeated indices and sorting of indices. I'll look into that.
I'm however a bit curious about your input data. As I understand it, the last value in your indices
value is to be ignored (in scipy)? Why is it there?
It was indeed a design choice that the basic constructor for CsMat would not do too much magic
Sounds like a reasonable thing to do, and now I do wonder about the performance cost of always doing this duplicate summing in scipy, even when it's not necessary.
Looks like the proper choice would be to add a constructor that is able to produce this summation of repeated indices and sorting of indices.
Even a util function something like sum_duplicates(indices, indptr, values) -> (indices, values)
could be a start.
As I understand it, the last value in your indices value is to be ignored (in scipy)?
All values in indices
are used. There is some difference in the indptr
convention about the last element I think -- and I also wonder why.
Also, I would be happy to contribute this feature in general.
Looks like the proper choice would be to add a constructor that is able to produce this summation of repeated indices and sorting of indices.
Even a util function something like
sum_duplicates(indices, indptr, values) -> (indices, values)
could be a start.
Also, I would be happy to contribute this feature in general.
Yes I also think this is the way to go. If you want to contribute this feature, please do. May I suggest that
the util function takes its indices
and values
parameters as &mut
pointer? That way integrating this
function into a new CsMat
constructor would be smoother, and there's no loss in flexibility, one can always clone before passing to this function.
As I understand it, the last value in your indices value is to be ignored (in scipy)?
All values in
indices
are used. There is some difference in theindptr
convention about the last element I think -- and I also wonder why.
I really think the last value has been ignored by scipy. If you look at the sum of the data
array you pass, it is 9, but X.data.sum() == 8
. So it looks like scipy will only consider your indices
and data
arrays up to the nnz
value specified by the last element of indptr
.
If you look at the sum of the data array you pass, it is 9, but X.data.sum() == 8
yes, my indptr
was incorrect, it should have been indptr = [0, 5, 9]
which explains why the last element of indices
was ignored.
May I suggest that the util function takes its indices and values parameters as &mut pointer?
Sure, will do.