sklearn-xarray icon indicating copy to clipboard operation
sklearn-xarray copied to clipboard

Merge various xarray scikit-learn wrappers

Open nbren12 opened this issue 6 years ago • 15 comments

Hi Peter,

cc @darothen

I just wanted to get the ball rolling on combining the functionality of our packages. I think the main thing I could contribute to a merged package would be an xarray aware version of scikit-learn's Union objects. These are for performing different transformation pipelines on different variables in a Dataset, and then concatenating the output. The goal I had with my package was to provide a similar functionality as https://github.com/scikit-learn-contrib/sklearn-pandas.

I know that Daniel Rothernberk (cc'd) was also thinking about releasing a package with similar functionality, so I am sure he might have something to add.

nbren12 avatar Jan 17 '18 22:01 nbren12

Hi Noah, I've looked into it and I think this would be a very valuable contribution, however we'd have to approach it at a higher level.

Here's an example that demonstrates what the problem currently is:

from sklearn_xarray import wrap
from sklearn_xarray.data import load_digits_dataarray
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA, FactorAnalysis

X = load_digits_dataarray()

pca = wrap(PCA, reshapes='feature')
fa = wrap(FactorAnalysis, reshapes='feature')
fu = wrap(FeatureUnion([('fa', fa), ('pca', pca)]), reshapes='feature')

Xt = fu.fit_transform(X)

In []: Xt
Out[]:
<xarray.DataArray (sample: 1797, feature: 128)>
array([[ -2.774548e-02,   1.651637e+00,  -7.823751e-01, ...,  -2.724035e-30,
         -5.115270e-31,  -6.913927e-16],
       [  5.531233e-01,  -1.620620e+00,   3.651561e-01, ...,  -3.480428e-16,
          3.754779e-16,   1.670464e-16],
       [  5.087785e-01,  -7.738681e-01,   2.485600e-01, ...,   2.722518e-15,
         -2.406039e-16,   1.166543e-16],
       ...,
       [  7.851090e-01,  -5.591102e-01,   4.747128e-01, ...,   1.692116e-15,
          8.208560e-17,   9.708098e-17],
       [ -2.993117e-01,   9.862983e-01,  -8.658195e-01, ...,   1.890185e-15,
          1.426731e-16,  -1.237629e-16],
       [ -3.705396e-02,   4.995286e-01,   8.938544e-01, ...,  -2.736311e-16,
          7.196079e-16,   2.706710e-16]])
Coordinates:
  * sample   (sample) int32 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
    digit    (sample) int32 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 ...
Dimensions without coordinates: feature

So it works in theory, but because of the reshapes='feature' argument we lose all coordinates along the feature dimension.

In order to keep coordinates for estimators that change the number of features we'd have to think about some mechanism that maps coordinates when calling transform and predict. My idea is related to #7, i.e. explicitly implementing a wrapped estimator for each of the estimators in sklearn. Each wrapped estimator would have to implement a method like get_reshaped_coords which returns the reshaped coordinates. Additionally, we would need _CommonEstimatorWrapper._update_coords to use these reshaped coordinates.

So in the case of FeatureUnion we would have something like:

class FeatureUnion(_CommonEstimatorWrapper, _ImplementsTransformMixin):

    def __init__(self, **fit_params):
        super(self, FeatureUnion).__init__(sklearn.pipeline.FeatureUnion, **fit_params)

    def get_reshaped_coords(self, coords, dim)
        # TODO: stack coords along dim
        return coords

If you want to implement this, I would be very happy. The code in _update_coords is a bit hard to read, buy maybe I can take care of that part.

phausamann avatar Jan 18 '18 11:01 phausamann

So it works in theory, but because of the reshapes='feature' argument we lose all coordinates along the feature dimension.

It should be pretty easy to pass along a pd.MultiIndex object which keeps track of the concatenated dimension.

My idea is related to #7, i.e. explicitly implementing a wrapped estimator for each of the estimators in sklearn. Each wrapped estimator would have to implement a method like get_reshaped_coords which returns the reshaped coordinates.

Before I dive into your code too much, I wanted to bring up a couple of points. First, there are two approaches to wrapping an sklearn pipeline. The first is to wrap each individual transformer with xarray methods, which is what it seems you are trying to do with your package. The second is to wrap the whole pipeline object, and letting most of the internal transformations occur in numpy-land, which is the approach I take. The advantage of this is I don't have to keep up with the sklearn API, and can just provide some xarray-aware transformers that help me do simple preprocessing and reshaping. I then have a Stacker object which takes an multidimensional xarray and spits out a 2D numpy array.

Second, since FeatureUnion joins multiple pipelines it is a little more complex. One solution, is to make an XarrayFeatureUnion object which has methods with the following signatures:

transform(x: dict of data arrays) -> np.ndarray
inverse_transform(np.ndarray) -> dict of dataarrays

Of course, the output of this could also be a 2D DataArray which keeps track of the feature coordinate in a MultiIndex, but I am not sure I see the point.

nbren12 avatar Jan 18 '18 16:01 nbren12

Hi @nbren12 and @phausamann ,

Thanks for jump-starting this discussion. My approach is an odd combination of your all's; I implemented it for a project here. The basic idea is to explicitly emulate the sklearn Transformation/Union API, but to provide utility tools that will work with xarray objects (mostly Datasets, although I think all of them work fine with DataArrays). The end result is that you can write traditional scikit-learn Pipelines but operate directly on xarray objects:

        _model = Pipeline([
            ('subset_latlon', DatasetSelector(
                sel='isel', lon=gcf.ilon, lat=gcf.ilat)
            ),
            ('predictors', FieldExtractor(self.predictors)),
            ('normalize', Normalizer()),
            ('dataset_to_array', DatasetAdapter(drop=['lat', 'lon'])),
            ('linear', LinearRegression()),
        ])

I wrote things this way because I often wanted to do analyses using neighborhoods of cells, and this framework let me construct the feature sets that I needed. I parallelize by essentially applying each Pipeline to every grid cell in a Dataset, which is trivial to do using joblib. I'm sure there are better approaches!

darothen avatar Jan 18 '18 17:01 darothen

First, there are two approaches to wrapping an sklearn pipeline. The first is to wrap each individual transformer with xarray methods, which is what it seems you are trying to do with your package.

That's a good point, I chose that approach for a couple of reasons. First, if you want to use individual estimators with xarray, you'll have to wrap them anyway. Second, that way you can just use the standard pipeline from sklearn without any additional overhead and it works with Datasets and DataArray. But most importantly, I wanted to introduce a way of having preprocessing transformers that change the number of samples in the pipeline. For this to work, I use xarray coordinates as targets for supervised tasks. This means however that I have to update the object pointing at the coordinates in every step of the pipeline which the wrapper takes care of.

Of course, the output of this could also be a 2D DataArray which keeps track of the feature coordinate in a MultiIndex, but I am not sure I see the point.

I think this would actually be the better approach in order to make sure we don't break compatibility with whatever comes after the FeatureUnion in case the FeatureUnion itself is part of a pipeline.

phausamann avatar Jan 18 '18 18:01 phausamann

@darothen At first glance, that approach seems very similar to what I have been working on. Instead of FieldExtractor, DatasetAdapter , and DatasetSelector, I have Select and Stacker classes that work directly with sklearn.Pipeline. Here is an example of how they are used. The main difference with Peter's approach is that at some point in the pipeline the data is transformed into a plain numpy array. However, we can always use Peter's wrap function with standard pipeline objects. The only piece that is missing is a Union object that has an inverse_transform method---for some reason sklearn's does not have this.

First, if you want to use individual estimators with xarray, you'll have to wrap them anyway.

I agree that it makes sense to wrap some of the more commonly used estimators.

But most importantly, I wanted to introduce a way of having preprocessing transformers that change the number of samples in the pipeline. For this to work, I use xarray coordinates as targets for supervised tasks.

I am not sure I understand. Do you have an example of what you mean? I usually think of sample_dims over which all the pairs will be considered as samples. For example, if I have a 3d field f(x,y,z) I often consider the individual horizontal locations (x,y) as samples of the data along the z-axis if that makes sense. For things like cross-validation, I just manually split the data.

As a whole, I think we can easily accommodate all of our approaches.

nbren12 avatar Jan 18 '18 18:01 nbren12

Here's an example from the docs. Basically, the Sanitizer removes samples from the dataset which would not work in a normal pipeline because X and y would have an inconsistent number of samples. But rather than passing an explicit y to fit I use a Target object that points to one of the coordinates and gets updated at each step.

phausamann avatar Jan 18 '18 19:01 phausamann

Ok. I think I may understand now. I have always assumed the number of samples are constant between the input and output, but I can see how making the whole pipeline in xarray allows a better way to keep track of/ignore missing values. I probably haven't come across this because I work with atmospheric model output, which is very clean.

Do you think there could also be a home in your package for the more low-level approach that Daniel and I use?

nbren12 avatar Jan 18 '18 19:01 nbren12

Yeah, totally! There is btw also the possibility to use wrapped estimators in a pipeline with plain numpy arrays, the wrapped estimator determines it's input time at fit time... so if you have estimators that return numpy arrays that shouldn't be a problem. I'd just like to avoid introducing too many special cases or breaking compatibility when it's not absolutely necessary.

Anyway, looking @darothen's project, there are a couple of transformers that could go into the preprocessing module. Maybe we could also add modules that bundle more domain-specific functionalities.

And regarding FeatureUnion... I still feel like a general solution that keeps feature coordinates intact would be the most promising approach, also regarding things like FeatureSelection where we have a similar problem.

phausamann avatar Jan 18 '18 20:01 phausamann

Cool! FWIW I have a couple transformers in here. At first glance it seems that @darothen and I differ on how we handle datasets vs dataarrays. I have been working on pipelines where I have a different preprocessing step for each variable. I am mostly following this example from the scikit learn docs.

So all my pipelines look like this

mod = make_pipeline(
    make_union(
        make_pipeline(Select('a'), Stacker()),
        make_pipeline(Select('b'), Stacker())),
    LinearRegression())

Here, Select simply selects a key from a dict like object. For some cases, Datasets don't work because the variables may be defined over different subsets of given dimension, but variables within a datasets must share coordinates. In these cases, I need to use a a dict of DataArrays rather than a Dataset.

we could also add modules that bundle more domain-specific functionalities

That's a good idea. For example, I have some transformers that weight the data along some set of dimensions.

And regarding FeatureUnion... I still feel like a general solution that keeps feature coordinates intact would be the most promising approach

Sure. For my part, I think the output of FeatureUnion should be a DataArray where the features coordinate is a pd.MultiIndex coordinate. That would make it pretty easy to implement both transform and inverse_transform methods.

nbren12 avatar Jan 18 '18 21:01 nbren12

Thanks for the explanation, I see your point now. I think it would be very useful to have a mechanism to parallelize part of the pipeline on a per-variable basis. But maybe we should implement a dedicated estimator for that (probably in the pipeline module). Something like:

Parallelizer({'a': Stacker(), 'b': Stacker()})

that maps each variable to a dedicated estimator (or pipeline) and takes care of joining together the Dataset afterwards (we'd need a better name though...).

Regarding your issue with Datasets, that sounds like a valid feature request for xarray itself. But it should be very easy to support dicts of DataArrays, because under the hood, the wrapper for Datasets loops over the data_vars dict anyway, creating a dict of the fitted estimator for each variable.

phausamann avatar Jan 19 '18 07:01 phausamann

we'd need a better name though...

How about DataArrayUnion? I do think the output of DataArrayUnion.transform should be a DataArray with a multiindex rather than a Dataset for the reasons I mentioned above. That way it can accommodate something like combining the first 5 PCs of variable A and the first 3 PCs of variable B.

I think the Stacker transformers could be removed if we assume all the pipelines produces DataArrays with the same sample dims. So the syntax could look like

DataArrayUnion(
    [('First10PCsOfA', make_pipeline(Select('a'), PCA(10))),
    ('First5PCsOfB', make_pipeline(Select('b'), PCA(5)))],
    sample_dims=['x', 'y', 'time']
)

If you think that looks ok, I can start to work on it.

nbren12 avatar Jan 19 '18 17:01 nbren12

I feel like it should be possible to combine the two ideas into one single estimator that both (a) applies a transformer or pipeline to each variable in the dataset and (b) allows the user to specify the return type. Maybe something like this (I wouldn't call it DataArrayUnion if it doesn't always return a DataArray):

FeatureJunction(
    {'a': PCA(10), 'b': PCA(5)}, 
    return_type='DataArray'
)

This way we don't need the Select estimator, we just specify the transformer for each variable with the dict as the first parameter. It also integrates nicely into the existing internal API because fitted Dataset wrappers have an estimator_dict_ attribute anyway.

It would then also be possible to say return_type='Dataset' which should raise an error in this case because the feature dimensions don't match. In that case return_type='dict' would be a workaround as described above.

btw, I should probably make a wiki article describing the internal API to get you started.

phausamann avatar Jan 22 '18 08:01 phausamann

I definitely think the dictionary input idea is good, but I think it is better to provide it as a function sort of like how sklearn has make_union and make_pipeline. The problem with dictionary input, is that it won't allow for combining multiple pipelines of the same variable (e.g. first 10 pcs of stratosphere, with first 10 pcs of troposphere) because dictionaries cannot have two entries for one key. We could instead make the input a list of tuples like sklearn-pandas does.

Also, I would hesitate to specify the return type of a class using an argument. The internal logic is completely different between these cases, so I think it would be better to have two classes DataArrayUnion and DatasetUnion.

nbren12 avatar Jan 22 '18 18:01 nbren12

I definitely think the dictionary input idea is good, but I think it is better to provide it as a function sort of like how sklearn has make_union and make_pipeline.

afaik these are just a simplified functional interfaces to the Pipeline and FeatureUnion constructors. So we'd just provide a make_xyz function for whatever this estimator will be called in the end which does the same thing.

The problem with dictionary input, is that it won't allow for combining multiple pipelines of the same variable (e.g. first 10 pcs of stratosphere, with first 10 pcs of troposphere) because dictionaries cannot have two entries for one key. We could instead make the input a list of tuples like sklearn-pandas does.

Good point, let's do it this way.

Also, I would hesitate to specify the return type of a class using an argument. The internal logic is completely different between these cases, so I think it would be better to have two classes DataArrayUnion and DatasetUnion.

I see what you mean, but I thought in both cases we'd have a wrapper that applies one pipeline per variable, the difference being that in the DataArray case we'd concatenate the result along some dimension. Or am I wrong?

phausamann avatar Jan 22 '18 18:01 phausamann

Anyway, these things are just technicalities, I think you can start working on a PR and we'll continue the discussion there. I've put some information together in the wiki.

phausamann avatar Jan 23 '18 06:01 phausamann