sparse
sparse copied to clipboard
A dense axis
Would it make sense to allow certain axes as dense parts in the otherwise sparse array?
I.e. this could/would allow faster algorithms along the given axes.
In particular when dealing with sparse data with many axes but a few recurring dense axis, such abstraction would be really nice.
Would this be an option or is it too intrusive?
Hi! This is roughly the purpose of the BSR format in SciPy.
This is definitely in scope, although I don’t personally plan to do it. You’re free to give it a crack if you want, I’ll definitely help review.
I suppose you could make a format similar to CSR called CDR (compressed-dense-row) where you'd only have indptr
and data
arrays, and each nonzero row would be fully dense.
I'm not aware of any usage of this scheme in the wild, and it would really only be efficient for cases where the sparsity pattern is almost always on a per-row basis.
Ok, thanks for your insights.
The most generic way to do this, in my view, would be to have an arbitrary partition of "dense" and "sparse" axes. So the sparse array would effectively be a transpose and reshaping of a CSC sparse matrix.
Yes, one could definitely store the "inner dense dimensions" as an actual numpy array instead. And always reshuffle things appropriately when indexing or performing any other operations.
I thought about this some more and it seems the most generic thing to do is to have blocks rather than dense dimensions.
It took me a bit to figure out that "blocks are more generic than dense dimensions" meant that one could achieve the effect of dense dimensions by making the block size in that dimension equal to the array's size in that dimension.
There was a concern in another thread recently about the availability of algorithms for fancy new formats. Similar schemes have been around for several decades, but I don't know that anyone's been working on block COO matrices rather than block CSR matrices.
Out of curiosity, is enough of sparse
independent of scipy that a sparse.COO
subclass with a properly-dimensioned data
array would require only shape-checking modifications to .__init__()
and a change to treating the individual elements as arrays in .__matmul__()
and .transpose()
, or would this require more work?
I could really use such a feature. In case you are interested in a concrete use for this type of arrays, I am working with (time inhomogeneous) Markov Chains. The 2 dimensions needed for the state space should be sparse, but the time dimension is always dense.
Out of curiosity, is enough of sparse independent of scipy that a sparse.COO subclass with a properly-dimensioned data array would require only shape-checking modifications to
.__init__()
and a change to treating the individual elements as arrays in.__matmul__()
and.transpose()
, or would this require more work?
We only use SciPy to convert to and from SciPy arrays, so it's almost completely independent. However, the issue here is that every algorithm would need to be modified to work with the new format... If you (or anyone else) makes a PR, we can get a "base set" of features working, with the rest to come. :slightly_smiling_face: I'm very likely to jump in and help.
For my objective of utilizing slicing functionality of pydata/sparse, using structured array as COO.data seems enough for me.
But it looks that we need to implement arithmetic operations for COO with np.recarray data.
import numpy as np
from sparse import COO
rank = 3
n_points = 100
for denserank in [1, 2, 5]:
coords = np.random.randint(0, 2, (rank, n_points)) # duplicated coords
coords2 = coords + 1
data = np.random.randn(n_points, denserank).view('f8, ' * denserank).squeeze()
try:
spa1 = COO(coords, data, shape=(20, 20, 20),)
spa2 = COO(coords2, data, shape=(20, 20, 20),)
except Exception as e:
print("has_duplicate=True failed: denserank={} {}".format(denserank, e))
spa1 = COO(coords, data, shape=(20, 20, 20), has_duplicates=False)
spa2 = COO(coords2, data, shape=(20, 20, 20), has_duplicates=False)
print("spa1[1:]", spa1[1:])
try:
spa1 + spa2
except Exception as e:
print("spa1 + spa2 failed: denserank={} {}".format(denserank, e))
I have a few questions regarding the potential implementation.
Format
@hameerabbasi can you elaborate on
I thought about this some more and it seems the most generic thing to do is to have blocks rather than dense dimensions.
Do you mean that every dimension would have a block size, which would be implemented as a separate array dimension in the data array? That seems more complicated than dense axes because slicing such block-ed sparse axis would potentially break the block structure.
Class structure
What would be the best overall approach to implementing dense dimensions? It seems that dense dimensions/blocks belong to the base class because they would influence the shape calculations, is that correct?
Possible alternatives would be to modify one of the formats to supposed dense axes or implementing a mixin class.
Minimal functionality
What would be the minimal useful functionality set? It seems like per-axis sparse/dense conversion, slicing, tensordot, and elementwise ufunc would all be reasonable to consider.