MLJ.jl icon indicating copy to clipboard operation
MLJ.jl copied to clipboard

Sparse data and abstract matrix input

Open ablaom opened this issue 3 years ago • 10 comments

Discussions at MLJ meetings have turned to the problem of sparse data. Data can be observation-sparse or feature-sparse (or both). My feeling that the feature-sparse case is the more important use-case, and the more tricky to deal with. I originally thought one might handle this within the current tabular data format but I think this requires extra infra-structure that does not exist yet in Julia. Given limited resources the pragmatic thing to do would be to allow models that handle feature-sparse data to ingest the data as an abstract matrix (in addition to a dense table). When this is sparse, the performance benefits kick in.

Having "given up" on the uniform requirement of tabular data, we might just as well allow arbitrary models that currently take tabular input to ingest data in the form of matrices as well. It would be quite natural to roll this out at the same time as implementing the new optional data front-end for models. If we do allow matrices, an important design decision regards the output of models (say of transformers, or the form of the target in multi-target supervised models). I guess if we train on matrices, then matrices should be the output and similarly for tables.

Thoughts anyone?

@OkonSamuel @tlienart

ablaom avatar Jan 19 '21 02:01 ablaom

I seem to have outdated information for a big chunk of this discussion but fwiw I agree with you that "feature-sparse" seems the more relevant/important use case.

Note that in my experience, the story with the likes of pandas is not ideal either. Last I checked if you do get_dummies or some function of that name to do one-hot-encoding in pandas and pass that to scikitlearn it gets "densified" along the way (I may be incorrect).

One note: a big chunk of the cases for sparse data is encoding (like OHE), given models in MLJ can ingest data then do their own stuff on the data, you could imagine they do their own encoding as well and handle the sparsity as well which may be what you're already suggesting?

tlienart avatar Jan 19 '21 08:01 tlienart

Allowing input to models to be an instance AbstractMatrix is a good solution to the sparsity issue.(just let the model implementers worry about it). My only concern is the inconsistency in API. We could also add something like SparseTable scitype to ScentificTypes.jl and define some method stubs that any instance of this type should support. Then let developers worry about implementing these methods for their sparse table type.

OkonSamuel avatar Jan 19 '21 19:01 OkonSamuel

The absence of of sparse data supports in OneHotEncoder and ContinousEncoder makes then unusable for the large number of features / categorical features with large number of categories. Yes, sparse features can be created by hand but that nullifies the purpose of MLJ of making ML easy to do.

pgagarinov avatar Mar 31 '21 05:03 pgagarinov

I'm coming alongto bump a relatively old conversation here -- how does this topic relate to the (now a while ago) discussed notion of supporting sparse matrices throughout the MLJ flow (i.e. avoiding densification by MLJ proper, at least, even if individual model implementers don't handle this properly)?

yalwan-sage avatar Aug 24 '21 09:08 yalwan-sage

@yalwan-sage Good to hear from you!

Let me clarify that MLJ itself does not impose densification. The issue is that MLJ encourages implementers of the MLJ model interface to accept tabular input where this makes sense. If densification is inevitable, this is no big deal. This would also not be a big deal if wrapping matrices with a large number of columns as tables worked well. As far as I know, a suitable sparse tabular format does not exist. I initially thought Tables.matrix might serve this purpose, but for very large numbers of columns there are issues (see this discussion ).

As far as I can tell DataFrames deals with sparsity within columns, but not sparsity within rows.

Alternatively (or additionally) any model can choose to accept matrix data. In fact, in that case, it must be able to handle any AbstractMatrix. If that matrix happens to be sparse, then the fit algorithm can choose to avoid densification. There are a few models that accept matrix input, but I think only TSVDTransformer avoids densification in the internal algorithm:

julia> models() do m
       AbstractMatrix{Continuous} <: m.input_scitype
       end
 (name = EvoTreeClassifier, package_name = EvoTrees, ... )
 (name = EvoTreeCount, package_name = EvoTrees, ... )
 (name = EvoTreeGaussian, package_name = EvoTrees, ... )
 (name = EvoTreeRegressor, package_name = EvoTrees, ... )
 (name = TSVDTransformer, package_name = TSVD, ... )

Moving forward, someone either introduces a better feature-sparse tabular format (so that dealing with sparsity becomes more of an implementation detail) or, existing models that can support sparse data extend the input_sciype declarations to include AbstractMatrix and dispatch on the input accordingly.

I'm not yet convinced by @OkonSamuel 's suggestion that we need separate scitypes to handle sparse data. It seems to me that sparsity is more a property of the representation of the data, than of its "scientific" interpretation. Perhaps flagging a model as supporting sparse data with a model trait is better.

ablaom avatar Aug 24 '21 21:08 ablaom

As far as I can tell DataFrames deals with sparsity within columns, but not sparsity within rows.

Right, presumably because you can build your dataframe as a collection of SparseVectors

Perhaps flagging a model as supporting sparse data with a model trait is better.

So this is actually part of why i've come looking. I'm hoping to add support to LightGBM.jl for dataset construction from (to begin with) sparse matrices and I was wondering if there was a scitype or trait I needed to set to indicate this when patching up the interface. To the best of what I've understood you wrote, we don't yet have a finalised way to indicate sparse support as an implementer, and exactly how is not settled on. Is that right?

yalwan-sage avatar Aug 25 '21 07:08 yalwan-sage

Let me clarify that MLJ itself does not impose densification.

Thanks for that claification

yalwan-sage avatar Aug 25 '21 08:08 yalwan-sage

So, currently, models do not need to articulate that they support sparsity.

However, a zoom discussion with @OkonSamuel has raised for me another point, which is that it's probably worth models articulating (with a new trait) whether the core algorithm likes observations as rows or columns. Because of adjoint, there is no loss in generality in insisting on, say, the n x p convention on AbstractMatrix input (to match the requirement for tables). However, whether the user should provide an n x p Matrix/SparseMatrixCSV, or provide the adjoint of a p x n Array /SparseMatrix depends on the model (first is good for trees, the second is good for neural networks, for example). If we are resampling, then that is also a consideration, as resampling observations using the first form is always worse than the second (the point @OkonSamuel reminded me of today). (I'm inclined to say that requirements of the model would usually trump the desire for resampling efficiency.)

ablaom avatar Aug 25 '21 22:08 ablaom

Just to clarify, when you put SparseMatrixCSV you mean SparseMatrixCSC from SparseArrays, or is there something else I am unaware of?

yalwan-sage avatar Aug 26 '21 08:08 yalwan-sage

Yes, from sodlib SparseArrays.

ablaom avatar Aug 27 '21 00:08 ablaom