sparse
sparse copied to clipboard
CSC/ CSR classes
Is your feature request related to a problem? Please describe.
Could explicit CSR
and CSC
classes be added to this library?
I would love to stop using scipy.sparse.spmatrix
classes. Their matrix
API is a source of constant frustration. Lack of CSC and CSR arrays with an np.ndarray API has been a barrier to adoption not only of this library, but of xarray
and dask
for the scanpy
/ anndata
projects
Describe the solution you'd like
I would like there to be explicit CSR
and CSC
classes in this package. These should perform at least as well as scipy.sparse.spmatrix
counterparts.
While higher dimensional algorithms are developed, performant implementations for these cases can be taken from scipy.sparse
or developed from reference implementations. I think this is a low effort/ high pay-off path towards performant implementations for common formats. They wouldn't be n-dimensional, but there's a lot of value in being able to start consolidating efforts to this library before an n-dimensional generalization has the features and performance of the common 2d cases.
Describe alternatives you've considered
No explicit CSC/ CSR classes. These are subsets of more generic structures so why special case them?
I think this is the wrong approach due to how common these structures are. Even a flexible framework like taco
exports these formats (pytaco.csc
and pytaco.csr
).
We could also consider what code looks like if these classes are not defined. All dispatch here could change from isinstance(x, sparse.csr_matrix)
to something like isinstance(x, sparse.GXCS) and x.compressed_axes == (0,) and x.ndims == 2)
, but something closer to the first case is probably an easier sell.
CSC/ CSR classes as thin wrappers of an n-dimensional format
I think this is a great goal. However, it requires an implementation of that format which is at least as performant in most (probably all) cases as the scipy
equivalents for broad adoption. As mentioned above, performant implementations can be (1) sourced from a current dependency (scipy.sparse
) or (2) easily adapted from existing reference implementations.
Code specialized for these cases could gradually be replaced with more generic versions as they are implemented. However, performant implementation for these common cases would not be blocked by implementation of n-dimensional algorithms.
Additional context
I could expand on why I think this is the right way to go
I could go on about how important I think it is for there to be CSR and CSC classes in this library, and why it's important to implement these separately from higher dimensional generalizations. I find that makes this post too long. I'd be happy to add these arguments if asked.
Type hierarchy
I don't know what the appropriate type hierarchy is here. Should issubclass(CSR, GXCS)
work? Or should there be an abstract CompressedDimSparseArray
parent class for all of these? What is the result type of np.stack(: list[CSR])
.
These are things that could be decided on later, once the n-dimensional compressed representation is public API.
Hello, GCXS
already exists in this library; but the performance isn't great, we're working on improving it while keeping the same API (so it should be safe to use).
It'd be easy to inherit that into CSR
/CSC
classes. @daletovar would you be willing to work on this next? @ivirshup Would you be interested in contributing?
We could also consider what code looks like if these classes are not defined. All dispatch here could change from
isinstance(x, sparse.csr_matrix)
to something likeisinstance(x, sparse.GXCS)
andx.compressed_axes == (0,) and x.ndims == 2)
, but something closer to the first case is probably an easier sell.
This is what I had in mind when I added GCXS initially. I do understand wanting to have explicit CSR/CSC classes. I think that regardless of whether there are explicit classes or not, there needs to be some specialized code for CSR/CSC. Right now np.dot
for GCXS arrays runs at a similar speed to Scipy. We do a sort so as to return a canonical CSR/CSC. Scipy does not. I think this is the main discrepancy between them. However, indexing and element-wise ops are quite a bit slower. I think these are areas where there's a lot to be gained from special casing for two dimensions.
This is something I can look into adding. Unless you would prefer to work on this, @ivirshup? No pressure if you're not interested.
Very happy to see interest in this!
I am interested in contributing, but also have a fairly busy schedule at the moment, and can't implement it alone. The array API also seems quite large, so I'd imagine it's helpful to have multiple contributors on this.
It would probably also be good to have significant input from the maintainers here on class and code structure (e.g. what are the super classes, how are array_functions
and ufuncs
organized/ dispatched).
@daletovar I have done some proof of concept work at ivirshup/sparse_wrapper, including some numba implementations of indexing ops. In particular:
- Element wise ops (equivalent to broadcasting, right?) which can be faster/ more generic than scipy here: sparse_wrapper/_core/coordinate_ops.py
- I reuse indexing from scipy, but have been working on numba alternatives here: sparse_wrapper/_core/indexing.py. Initial results seem promising, though haven't been tested much.
I'd note that much of the array level behaviour is currently handled by simply wrapping a scipy.sparse.spmatrix
.
What would be a good first step here?
I imagine getting the classes to a point where individual features can be developed in separate PRs would be important, so having a stable class structure and the basic parameters down would be valuable.
Right, so this is how I envision it going:
- Inherit from
GCXS
, pass on the "right"compressed_axes
. - Implement conversion functions (
asformat
). - Modify
SparseArray.__array_ufunc__
to produceCSR
/CSC
.
These would be the first steps, I imagine.
With #442 merged, should this issue be closed, or is there work remaining? (Update: I found https://github.com/pydata/sparse/issues/443, I'll leave a link here in this comment in case others are as dense as me.)