sklearn-xarray Multi-dimensional LogisticRegression

Multi-dimensional LogisticRegression

Open aaronspring opened this issue 3 years ago • 7 comments

First of all: great software combining sklearn and xarray @phausamann

My example here is inspired from #52, an example which helped me a lot.

Input data: forecast with dimension T (sample), X (lon), Y (lat) Ground truth: observations with same dims Goal: Can LR predict the correct tercile? or Can LR correct the tercile?

I want do the same as looping over all X and Y on a grid and apply LR for every grid cell individually, but I want to use sklearn-xarray and get it vectorized (without the loop). Is that possible?

Question: Do I take X and Y as features, or do I have just one (the variable) as feature? But how do I vectorize the calculation over all X and Y?

# on develop branch
x = mask(x).rename({'T':'sample'})
x.coords # 3 categories -1,0,1 for terciles
Coordinates:
  * sample        (sample) datetime64[ns] 1999-06-23 1999-06-30 ... 2015-09-16
  * X             (X) float32 -20.0 -19.0 -18.0 -17.0 -16.0 ... 7.0 8.0 9.0 10.0
  * Y             (Y) float32 1.0 2.0 3.0 4.0 5.0 ... 16.0 17.0 18.0 19.0 20.0
    category_obs  (Y, X, sample) float64 nan nan nan nan ... 1.0 1.0 0.0 -1.0

y = Target(coord="category_obs", transform_func=LabelEncoder().fit_transform, dim='sample')
pl = Pipeline(
    [
        ('feat', Featurizer()),
        ("transposer", Transposer(order=('sample', "feature"))),
        ('sanitizer', Sanitizer(dim='feature')),
        ("elr", wrap(LogisticRegression(fit_intercept=False), reshapes="feature")),
    ]
)
pl.predict_proba(x).unstack().sizes
Frozen({'sample': 226, 'feature': 3}) # but I want to get X, Y, T, probability dim

gist: https://gist.github.com/aaronspring/745a40c3ebedb9ea59cd3862c5c22724

I think I shouldnt take X, Y as features, but I dont want to take X,Y,T as samples, because every grid cell should be trained on its own. I get that case running in https://renkulab.io/gitlab/aaron.spring/s2s-ai-competition-bootstrap/-/blob/master/notebooks/ELR_sklearn-xarray.ipynb, but I want the model be trained on every gridcell. I dont see how to have the dimension X, Y on the side with the sklearn requirement of just 2D arrays.

Helpful comments appreciated.

Apr 23 '21 11:04 aaronspring

Hi @aaronspring, if I understand correctly you want to train an independent classifier for each (X, Y) coordinate? That is, each grid cell is trained on x.sizes["T"] samples?

I don't think sklearn's LogisticRegression supports that, it will only produce a single prediction for each sample, so if you stack your X and Y dimensions as the feature dimension you will only get a prediction over all grid cells.

If you really want a completely independent model for each grid cell you won't get around training each classifier in a loop. You can take advantage of the fact that sklearn-xarray trains independent estimators for each variable in a dataset by creating independent data variables for each grid cell with x.stack(idx=["X", "Y"]).to_dataset("idx"). There's an open issue (#30) to use joblib for parallelizing the fitting for datasets.

On the other hand, if you are okay with other grid cells being taken into account for the prediction of each cell (which I think makes sense), I would suggest using a very simple MLPClassifier with activation="logistic".

Apr 28 '21 09:04 phausamann

In any case, this kind of feedback is super valuable because it shows what kinds of workflows should be demonstrated in the documentation. I think it would be valuable to have an example in the documentation similar to this.

Apr 29 '21 11:04 phausamann

Thanks for your reply @phausamann

if I understand correctly you want to train an independent classifier for each (X, Y) coordinate? That is, each grid cell is trained on x.sizes["T"] samples?

exactly. thats what I want to do. In the end, I want to do this over a 1.5 degree grid on 240x120 grid cells. and temp and precip from different parts of the globe shouldnt train each other.

independent estimators for each variable in a dataset by creating independent data variables for each grid cell with x.stack(idx=["X", "Y"]).to_dataset("idx")

I try this workaround:

# stack dims X, Y to idx as dataset, T to sample
X = x.stack(idx=['X','Y']).T.rename({'T':'sample'})

# sanitize
X = X.sel(idx=X['category_obs'].isnull().all('sample'))
X = X.expand_dims('feature',axis=-1) # my original variable is the one feature
X.shape  # (620, 226, 1) gridpoints, samples, features

X = X.to_dataset('idx')
X.coords # category_obs still depends on T
Coordinates:
  * sample         (sample) datetime64[ns] 1999-06-23 1999-06-30 ... 2015-09-16
    category_fcst  (idx, sample) float64 nan nan nan nan nan ... nan nan nan nan
    category_obs   (idx, sample) float64 nan nan nan nan nan ... nan nan nan nan
# has 217 data_vars

Y = Target(coord="category_obs", transform_func=LabelEncoder().fit_transform)(X)
ValueError: y should be a 1d array, got an array of shape (620, 226) instead.

I tried using dim='sample' or dim='idx' but then I got all 0s in Y.

May 01 '21 15:05 aaronspring

I also googled sklearn examples at lot but didn’t find anything like what I am looking for. Probably this is not what people do.

The workflow of sklearn on a grid is embarrassingly parallel when doing the same type of computation on each grid cell individually.

May 04 '21 21:05 aaronspring

I now do a loop over stacked lon and lat: https://gist.github.com/aaronspring/e9f4edba833664ef179113ed25ccea51

May 05 '21 17:05 aaronspring

I posted this issue and the larger question in the pangeo discourse: https://discourse.pangeo.io/t/vectorized-sklearn/1444

May 06 '21 09:05 aaronspring

during the https://s2s-ai-challenge.github.io/ a solution appeared using xr.apply_ufunc: https://gist.github.com/aaronspring/36e112e992e36fba935f73404dbbd3cd

Nov 17 '21 22:11 aaronspring

sklearn-xarray sklearn-xarray copied to clipboard

Multi-dimensional LogisticRegression

sklearn-xarray
sklearn-xarray copied to clipboard