sklearn-xarray icon indicating copy to clipboard operation
sklearn-xarray copied to clipboard

Examples for higher dimension data

Open mmann1123 opened this issue 3 years ago • 9 comments

I think that sklearn-x could be a significant contribution to folks working on satellite remote sensing. I am REALLY excited by it!

I am however have a hard time figure out how to apply it in higher dim xarray structures. I am hoping that more examples using ideally geospatial data could be provided.

Possible format for example A typical land cover classification where 'target' must be predicted as f(blue, green, red, nir) across time on a pixel by pixel basis.

rgbn_20160101 = image from satellite stored as array for jan 1 2016 ['blue', 'green', 'red','nir'] = each image has four bands the three visible bands, and near infrared
['t1', 't2','t3'] = three images are provided across time for the same coordinates ['y,'x'] = latitude and longitude of each pixel

from geowombat.data import rgbn_20160101, rgbn_20160401, rgbn_20160517

import geowombat as gw
with gw.open([rgbn_20160101, rgbn_20160401, rgbn_20160517],
                 band_names=['blue', 'green', 'red','nir'],
                 time_names=['t1', 't2','t3']) as src:
        print(src.load())
         
        target = (10 * np.random.rand(3, 403, 515)).astype('int')
        new_band = src.sel(band='blue')
        new_band.values = target
        new_band['band'].values  = 'target'
        out = xr.concat([src,  new_band] , dim="band")
        
        print(out)
<xarray.DataArray (time: 3, band: 5, y: 403, x: 515)>
array([[[[ 61,  81, 134, ...,  83,  86,  90],
         [ 64,  78, 113, ...,  91,  91,  83],
         [ 74,  80,  95, ...,  81,  82,  77],
         ...,
         ...,
         [  3,   0,   4, ...,   8,   9,   8],
         [  9,   8,   2, ...,   3,   3,   4],
         [  9,   0,   5, ...,   2,   1,   0]]]])
Coordinates:
  * y        (y) float64 2.05e+06 2.05e+06 2.05e+06 ... 2.048e+06 2.048e+06
  * time     (time) <U2 't1' 't2' 't3'
  * x        (x) float64 7.93e+05 7.93e+05 7.93e+05 ... 7.956e+05 7.956e+05
  * band     (band) <U6 'blue' 'green' 'red' 'nir' 'target'
Attributes:
      ...

I would normally flatten these array using a multi-index as follows and store them as a pd.dataframe:

Pixel Time Target R G ....
1 T1 Forest 50 230 ....
1 T2 Agriculture 180 30 ....
2 T1 Water 43 55 ....
2 T2 Water 50 66 ....

Then fit a model Target = f(red,green,blue, nir) on a subset of pixels where I have "target" land use clases, and predict back to all red green blue nir bands in the full data set.

It looks as if sk-xr can handle this, but it isn't clear to me how to structure the xarray to create a panel of longitudinal data (pixel values over time).

mmann1123 avatar Aug 07 '20 17:08 mmann1123

Another possible xarray structure, that may be easier:

dataset = xr.merge([ src.to_dataset(dim='band'), new_band.to_dataset(name='Target')])
print(dataset)
Dimensions:  (time: 3, x: 515, y: 403)
Coordinates:
  * x        (x) float64 7.93e+05 7.93e+05 7.93e+05 ... 7.956e+05 7.956e+05
  * y        (y) float64 2.05e+06 2.05e+06 2.05e+06 ... 2.048e+06 2.048e+06
  * time     (time) <U2 't1' 't2' 't3'
    band     <U6 'target'
Data variables:
    blue     (time, y, x) uint8 61 81 134 132 88 104 ... 197 181 143 156 156 137
    green    (time, y, x) uint8 44 83 144 138 74 102 ... 213 194 150 166 165 148
    red      (time, y, x) uint8 44 74 139 143 74 98 ... 216 195 145 165 172 154
    nir      (time, y, x) uint8 24 90 129 107 66 81 ... 178 158 148 144 130 111
    Target   (time, y, x) int64 9 3 3 5 7 9 1 9 5 8 4 ... 6 3 8 1 7 2 0 4 3 1 7

mmann1123 avatar Aug 07 '20 18:08 mmann1123

Hi @mmann1123, thanks for bringing this up.

The documentation is definitely missing examples for geospatial data as I myself use xarray for very different kinds of data, so this is something I'm guessing most of the user base would appreciate.

In your example, after loading the DataArray like this:

import numpy as np
import geowombat as gw
from geowombat.data import rgbn_20160101, rgbn_20160401, rgbn_20160517

with gw.open(
    [rgbn_20160101, rgbn_20160401, rgbn_20160517],
    band_names=["blue", "green", "red", "nir"],
    time_names=["t1", "t2", "t3"],
) as src:
    src.load()

the recommended way is to create a non-dimensional coordinate with the target values. Here's a very simple mock target where pixels with green channel values above 128 are assumed to be forest, the rest water:

land_use = np.tile(
    "water", (src.sizes["time"], src.sizes["y"], src.sizes["x"])
).astype(object)
land_use[src.sel(band="green").values > 128] = "forest"
land_use = land_use.astype(str)
src.coords["land_use"] = (["time", "y", "x"], land_use)

Now, for compatibility with sklearn, you need to stack all sample dimensions to get a 2D array:

X = src.stack(sample=("x", "y", "time")).T
X
<xarray.DataArray (sample: 622635, band: 4)>
array([[ 61,  44,  44,  24],
       [ 61,  44,  44,  24],
       [ 61,  44,  44,  24],
       ...,
       [137, 148, 154, 111],
       [137, 148, 154, 111],
       [137, 148, 154, 111]], dtype=uint8)
Coordinates:
  * band      (band) <U5 'blue' 'green' 'red' 'nir'
    land_use  (sample) <U6 'water' 'water' 'water' ... 'forest' 'forest'
  * sample    (sample) MultiIndex
  - x         (sample) float64 7.93e+05 7.93e+05 ... 7.956e+05 7.956e+05
  - y         (sample) float64 2.05e+06 2.05e+06 ... 2.048e+06 2.048e+06
  - time      (sample) object 't1' 't2' 't3' 't1' 't2' ... 't3' 't1' 't2' 't3'
Attributes:
  ...

The land_use coordinate of this array can now be used as a target:

from sklearn_xarray import Target
from sklearn.preprocessing import LabelBinarizer

y = Target(coord="land_use", transform_func=LabelBinarizer().fit_transform)(X)

and finally you can fit an estimator:

from sklearn_xarray import wrap
from sklearn.linear_model import LogisticRegression

wrapper = wrap(LogisticRegression(), reshapes="band")
wrapper.fit(X, y)

You can then use the fitted estimator for prediction. Unfortunately, the output of the prediction are the labels from the LabelBinarizer, but we can easily reverse this:

yp = wrapper.predict(X)
yp.values = LabelBinarizer().fit(X.land_use).classes_[yp]

Finally, all you have to do is unstack the sample dimension again:

yp = yp.unstack("sample")
yp
<xarray.DataArray (x: 515, y: 403, time: 3)>
array([[['water', 'water', 'water'],
        ['water', 'water', 'water'],
        ['water', 'water', 'water'],
        ...,
        ['forest', 'forest', 'forest'],
        ['forest', 'forest', 'forest'],
        ['forest', 'forest', 'forest']]], dtype='<U6')
Coordinates:
    land_use  (x, y, time) <U6 'water' 'water' 'water' ... 'forest' 'forest'
  * x         (x) float64 7.93e+05 7.93e+05 7.93e+05 ... 7.956e+05 7.956e+05
  * y         (y) float64 2.05e+06 2.05e+06 2.05e+06 ... 2.048e+06 2.048e+06
  * time      (time) object 't1' 't2' 't3'

phausamann avatar Aug 10 '20 15:08 phausamann

I would like to include this as an example in the docs pretty much verbatim. However, the geowombats package seems to be quite heavy on dependencies and ideally I'd want a dataset where ground truth land use is included. Could you maybe point me to a resource for this kind of data?

phausamann avatar Aug 10 '20 15:08 phausamann

Hey @phausamann thanks for the response to this! In the meantime was was futzing with Featurizer and will propose a change (mmann1123:patch-1) to handle multidimensional samples (quite sure it is working for DataArrays but not quite sure about DataSets (not sure what the output should look like).

I have been using geowombat pretty much exclusively for remotely sensed data. I actually tried to use xarray to create a dummy dataset but in the same format and failed. I will try to contact Jordan about an example for you to work with.

mmann1123 avatar Aug 10 '20 18:08 mmann1123

@phausamann I just pushed a Stackerizer class to (mmann1123:patch-1) please read the comments.

Also, in a related issue, I am finding that I can't use the Stackerizer inside of the pipeline, BUT it works fine if I run it before the pipeline. Just wondering what I'm missing

print(X)
dask.array<concatenate, shape=(3, 42, 285, 371), dtype=float64, chunksize=(1, 1, 128, 128), chunktype=numpy.ndarray>
Coordinates:
  * band      (band) int64 1 2 3 4 5 6 7 8 9 10 ... 34 35 36 37 38 39 40 41 42
  * x         (x) float64 33.01 33.05 33.09 33.13 ... 47.84 47.88 47.92 47.96
  * y         (y) float64 14.9 14.86 14.82 14.78 ... 3.537 3.497 3.456 3.416
  * time      (time) <U1 '1' '2' '3'
    land_use  (time, y, x) <U6 'water' 'water' 'water' ... 'water' 'water'


pl = Pipeline(
    [
        ('stackerizer',Stackerizer(stack_dims = ('x','y','time'), direction='stack')),
        ("sanitizer", Sanitizer(dim='band')),    # Remove elements containing NaNs.
        ("featurizer", Featurizer()),  # Stack all dimensions and variables except for sample dimension.
        ("scaler", wrap(StandardScaler)), # zscores , ?wrap if xarray.self required? 
        ("pca", wrap(PCA, reshapes="feature")), 
        ("cls", wrap(GaussianNB, reshapes="feature")),
    ]
)

etc. 

returns

File "/home/mmann1123/Documents/github/xr_fresh/notebooks/model_experiments.py", line 205, in <module>
   coord="land_use", transform_func=LabelEncoder().fit_transform )(X)
...
ValueError: y should be a 1d array, got an array of shape (3, 285, 371) instead.

Let me know if you want me to push Stackerizer to a different branch.

mmann1123 avatar Aug 11 '20 16:08 mmann1123

I think I know what the issue is: you are creating a Target instance with a LabelEncoder transformer and directly assigning it to a non-2d array.

With estimators wrapped by sklearn-xarray, it is not necessary to assign the target beforehand, the wrapped estimator will automatically take care of this.

So instead of:

y = Target(coord="land_use", transform_func=LabelEncoder().fit_transform)(X)

do this:

y = Target(coord="land_use", transform_func=LabelEncoder().fit_transform)

The assignment (y = y(X))) is performed by the wrapped estimators when necessary, which doesn't happen before the StandardScaler step and at this point, the data is already 2D.

Probably another thing to clarify in the documentation.

phausamann avatar Aug 12 '20 14:08 phausamann

I'll look into it more but if I drop (X) i get:

~/anaconda3/envs/sk-wombat/lib/python3.7/site-packages/sklearn/utils/validation.py in _num_samples(x)
    195         if len(x.shape) == 0:
    196             raise TypeError("Singleton array %r cannot be considered"
--> 197                             " a valid collection." % x)
    198         # Check that shape is returning an integer or default to len
    199         # Dask dataframes may not return numeric shape[0] value

TypeError: Singleton array Unassigned sklearn_xarray.Target with coordinate "land_use". cannot be considered a valid collection.

mmann1123 avatar Aug 13 '20 16:08 mmann1123

When you define the target like this:

y = Target(
    coord="land_use",
    transform_func=LabelBinarizer().fit_transform,
    dim="sample",
)

it works.

This tells the Target to reduce multi-dimensional coordinates to the sample dimension, yielding the 1D input that the LabelBinarizer expects. It is weird though that you get an error without that, I'll have to investigate what's going on there.

Here's the pipeline I've used:

pl = Pipeline(
    [
        ("stacker", Stacker(stack_dims=["x", "y", "time"])),
        ("transposer", Transposer(order=("sample", "band"))),
        ("scaler", wrap(StandardScaler)),
        ("pca", wrap(PCA, reshapes="feature")),
        ("cls", wrap(GaussianNB, reshapes="feature")),
    ]
)

pl.fit(X, y)

phausamann avatar Aug 14 '20 08:08 phausamann

Ok great. That is super helpful as always. Much appreciated.

On Fri, Aug 14, 2020 at 4:09 AM Peter Hausamann [email protected] wrote:

When you define the target like this:

y = Target( coord="land_use", transform_func=LabelBinarizer().fit_transform, dim="sample", )

it works.

This tells the Target to reduce multi-dimensional coordinates to the sample dimension, yielding the 1D input that the LabelBinarizer expects. It is weird though that you get an error without that, I'll have to investigate what's going on there.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/phausamann/sklearn-xarray/issues/52#issuecomment-673949091, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHR6VFV6QD2J7YY32E4DFLSATWKNANCNFSM4PX3LFOQ .

mmann1123 avatar Aug 14 '20 13:08 mmann1123