scikit-learn
scikit-learn copied to clipboard
RFC Supporting `scipy.sparse.sparray`
Context
SciPy is now favoring sparse arrays (i.e. scipy.sparse.sparray and its subclasses) over sparse matrices (i.e. scipy.sparse.spmatrix and its subclasses) to enlarge the scope of matrices to $n$-dimensional data-structures since SciPy 1.8 (see https://github.com/scipy/scipy/pull/14822).
Sparse matrices now subclass their associated sparse arrays (see https://github.com/scipy/scipy/pull/18440).
scikit-learn has been supporting sparse matrices but now also needs to support SciPy sparse arrays.
Proposed solutions
Ordered by preference:
- Use
scipy.sparse.issparseand theformatattribute everywhere and not useisinstanceat all - Use
isinstanceon private compatibility class defined to bescipy.sparse.spmatrixorscipy.sparse.sparrayconditionally on SciPy's version (i.e. usescipy.sparse.sparrayif available) - Rely on duck-typing or the class name to check for sparse arrays
cc @ivirshup
In general, I'm +1 with supporting scipy.sparse.sparray. Although, we likely need to support both spmatrix and sparray for a while. To support both, I suspect we need to be careful about some of the behavior differences between spmatrix and sparray.
https://github.com/scikit-learn/scikit-learn/pull/26420 has been opened as a small preliminary.
Also +1 for supporting scipy.sparse.sparray.
The real question for me is what to do with the old sparse matrices:
- Follow scipy. (What‘s their plan? Deprecation, removal, continued support?)
- Deprecate, maybe longer depr cycle than usual.
- Continued support
I‘m leaning towards 2 or 1.
SciPy is going to deprecate matrices and will ultimately remove them.
https://github.com/scikit-learn/scikit-learn/issues/27090 has been opened to update tests accordingly.
What should we do about things that return sparse classes?
After brief discussion with @jjerphan and @ogrisel it sounds like the desired behavior is to pass through the array classes as much as possible. E.g. we should try and have f(X: *_array, ...) -> *_array and f(X: *_matrix, ...) -> *_matrix where applicable.
I also think we should preserve the type of inputs.
I think that functions returning sparse matrices which do not have sparse matrices or array are arguments can for now continue to return sparse matrices until the version of SciPy who will officially deprecate sparse matrices is installed (in which case, sparse arrays would be returned instead).
Use scipy.sparse.issparse and the format attribute everywhere and not use isinstance at all
I would be in favor of trying to do this as much as possible, but there might be some cases where the two have different behaviors and an explicit testing for the type might be necessary. I wonder what would be a concise way to do this to keep backward compat with scipy 1.8 though...
Maybe something like:
import scipy.sparse as sp
if sp.issparse(X):
if hasattr(sp, "isspmatrix") and not sp.isspmatrix(X):
# special handling for `scipy.sparse.sparray`
...
else:
# backward compat for `scipy.sparse.spmatrix`
...
I think that functions returning sparse matrices which do not have sparse matrices or array are arguments can for now continue to return sparse matrices until the version of SciPy who will officially deprecate sparse matrices is installed (in which case, sparse arrays would be returned instead).
Once scipy starts issuing deprecation warnings, we might need a way to allow the users to silence them though.
Maybe an estimator with sparse_output=True could start to issue a FutureWarning to state that in two versions, scikit-learn will return scipy.sparse.sparray instances instead of scipy.sparse.spmatrix instances. To silence the warning we could introduce a new temporary constructor parameter sparse_output_container="sparray" so has to make it possible to handle the transition.
Note that for ColumnTransformer, there is not sparse_output boolean parameter but instead it uses sparse_threshold=0.3. Introducing a new parameter sparse_output_container="sparray" is probably the only to respect our rolling deprecation strategy for backward compat.
Maybe something like:
There has been discussion of removing some of the isspmatrix_*/ isspmatrix functions and replacing them with instance checks. E.g.: isinstance(x, sp.spmatrix). However, there will be some releases of scipy where issubclass(sp.csr_array, spmatrix) is True... I think this could be worth a helper function which may need to take into account the version of scipy.
- https://github.com/scipy/scipy/pull/18565
- https://github.com/scipy/scipy/issues/18921
To silence the warning we could introduce a new temporary constructor parameter sparse_output_container="sparray" so has to make it possible to handle the transition.
:+1: this is the approach I've been taking with a number of deprecations. It's a little annoying due to the number of warnings and downstream authors have to change code twice, but I haven't been able to come up with something as safe with less action required.
One concern here would be if you'd like to return other kinds of sparse array, like sparse pytorch.Tensor, jax.experimental.sparse, or cupyx.scipy.sparse matrices. Do you think this is something scikit-learn may want to support, and is the string enum argument enough to support this?
One concern here would be if you'd like to return other kinds of sparse array, like sparse pytorch.Tensor, jax.experimental.sparse, or cupyx.scipy.sparse matrices. Do you think this is something scikit-learn may want to support, and is the string enum argument enough to support this?
Ideally, if those implement the Array API (which is not the case at the moment), this might be possible for estimators that accept Array API inputs. For estimators that work with Cython code and directly access the underlying CSR/CSC memory buffers via pointers then there is no solution unless we provide an arguably less efficient, pure Array API code path for such inputs.
I think this could be worth a helper function which may need to take into account the version of scipy.
Let's do this in the first PR where we actually need it.