scikit-learn icon indicating copy to clipboard operation
scikit-learn copied to clipboard

RFC Supporting `scipy.sparse.sparray`

Open jjerphan opened this issue 2 years ago • 2 comments

Context

SciPy is now favoring sparse arrays (i.e. scipy.sparse.sparray and its subclasses) over sparse matrices (i.e. scipy.sparse.spmatrix and its subclasses) to enlarge the scope of matrices to $n$-dimensional data-structures since SciPy 1.8 (see https://github.com/scipy/scipy/pull/14822).

Sparse matrices now subclass their associated sparse arrays (see https://github.com/scipy/scipy/pull/18440).

scikit-learn has been supporting sparse matrices but now also needs to support SciPy sparse arrays.

Proposed solutions

Ordered by preference:

  • Use scipy.sparse.issparse and the format attribute everywhere and not use isinstance at all
  • Use isinstance on private compatibility class defined to be scipy.sparse.spmatrix or scipy.sparse.sparray conditionally on SciPy's version (i.e. use scipy.sparse.sparray if available)
  • Rely on duck-typing or the class name to check for sparse arrays

cc @ivirshup

jjerphan avatar May 22 '23 23:05 jjerphan

In general, I'm +1 with supporting scipy.sparse.sparray. Although, we likely need to support both spmatrix and sparray for a while. To support both, I suspect we need to be careful about some of the behavior differences between spmatrix and sparray.

thomasjpfan avatar May 23 '23 00:05 thomasjpfan

https://github.com/scikit-learn/scikit-learn/pull/26420 has been opened as a small preliminary.

jjerphan avatar May 23 '23 23:05 jjerphan

Also +1 for supporting scipy.sparse.sparray. The real question for me is what to do with the old sparse matrices:

  1. Follow scipy. (What‘s their plan? Deprecation, removal, continued support?)
  2. Deprecate, maybe longer depr cycle than usual.
  3. Continued support

I‘m leaning towards 2 or 1.

lorentzenchr avatar Jul 08 '23 17:07 lorentzenchr

SciPy is going to deprecate matrices and will ultimately remove them.

jjerphan avatar Jul 08 '23 20:07 jjerphan

https://github.com/scikit-learn/scikit-learn/issues/27090 has been opened to update tests accordingly.

jjerphan avatar Aug 18 '23 10:08 jjerphan

What should we do about things that return sparse classes?

After brief discussion with @jjerphan and @ogrisel it sounds like the desired behavior is to pass through the array classes as much as possible. E.g. we should try and have f(X: *_array, ...) -> *_array and f(X: *_matrix, ...) -> *_matrix where applicable.

ivirshup avatar Aug 18 '23 12:08 ivirshup

I also think we should preserve the type of inputs.

I think that functions returning sparse matrices which do not have sparse matrices or array are arguments can for now continue to return sparse matrices until the version of SciPy who will officially deprecate sparse matrices is installed (in which case, sparse arrays would be returned instead).

jjerphan avatar Aug 18 '23 12:08 jjerphan

Use scipy.sparse.issparse and the format attribute everywhere and not use isinstance at all

I would be in favor of trying to do this as much as possible, but there might be some cases where the two have different behaviors and an explicit testing for the type might be necessary. I wonder what would be a concise way to do this to keep backward compat with scipy 1.8 though...

Maybe something like:

import scipy.sparse as sp


if sp.issparse(X):
    if hasattr(sp, "isspmatrix") and not sp.isspmatrix(X):
        # special handling for `scipy.sparse.sparray`
        ...
    else:
        # backward compat for `scipy.sparse.spmatrix`
        ...

ogrisel avatar Aug 22 '23 08:08 ogrisel

I think that functions returning sparse matrices which do not have sparse matrices or array are arguments can for now continue to return sparse matrices until the version of SciPy who will officially deprecate sparse matrices is installed (in which case, sparse arrays would be returned instead).

Once scipy starts issuing deprecation warnings, we might need a way to allow the users to silence them though.

Maybe an estimator with sparse_output=True could start to issue a FutureWarning to state that in two versions, scikit-learn will return scipy.sparse.sparray instances instead of scipy.sparse.spmatrix instances. To silence the warning we could introduce a new temporary constructor parameter sparse_output_container="sparray" so has to make it possible to handle the transition.

Note that for ColumnTransformer, there is not sparse_output boolean parameter but instead it uses sparse_threshold=0.3. Introducing a new parameter sparse_output_container="sparray" is probably the only to respect our rolling deprecation strategy for backward compat.

ogrisel avatar Aug 22 '23 08:08 ogrisel

Maybe something like:

There has been discussion of removing some of the isspmatrix_*/ isspmatrix functions and replacing them with instance checks. E.g.: isinstance(x, sp.spmatrix). However, there will be some releases of scipy where issubclass(sp.csr_array, spmatrix) is True... I think this could be worth a helper function which may need to take into account the version of scipy.

  • https://github.com/scipy/scipy/pull/18565
  • https://github.com/scipy/scipy/issues/18921

To silence the warning we could introduce a new temporary constructor parameter sparse_output_container="sparray" so has to make it possible to handle the transition.

:+1: this is the approach I've been taking with a number of deprecations. It's a little annoying due to the number of warnings and downstream authors have to change code twice, but I haven't been able to come up with something as safe with less action required.

One concern here would be if you'd like to return other kinds of sparse array, like sparse pytorch.Tensor, jax.experimental.sparse, or cupyx.scipy.sparse matrices. Do you think this is something scikit-learn may want to support, and is the string enum argument enough to support this?

ivirshup avatar Aug 22 '23 09:08 ivirshup

One concern here would be if you'd like to return other kinds of sparse array, like sparse pytorch.Tensor, jax.experimental.sparse, or cupyx.scipy.sparse matrices. Do you think this is something scikit-learn may want to support, and is the string enum argument enough to support this?

Ideally, if those implement the Array API (which is not the case at the moment), this might be possible for estimators that accept Array API inputs. For estimators that work with Cython code and directly access the underlying CSR/CSC memory buffers via pointers then there is no solution unless we provide an arguably less efficient, pure Array API code path for such inputs.

I think this could be worth a helper function which may need to take into account the version of scipy.

Let's do this in the first PR where we actually need it.

ogrisel avatar Aug 24 '23 08:08 ogrisel