dask-ml icon indicating copy to clipboard operation
dask-ml copied to clipboard

Samplers / pipelines for imbalanced datasets

Open TomAugspurger opened this issue 6 years ago • 16 comments

Imbalanced datasets, where the classes have very different occurrence rates, can show up in large data sets.

There are many strategies for dealing with imbalanced data. http://contrib.scikit-learn.org/imbalanced-learn/stable/api.html implements a set, some of which could be scaled to large datasets with dask.

TomAugspurger avatar Jul 27 '18 01:07 TomAugspurger

Hi, I think that most of the changes would be to introduce the option of fit_resample and fit_sample into the fit_transform method.
I'll be happy to assist on this issue.

sephib avatar Feb 27 '20 17:02 sephib

@sephib do you have any examples of fit_resample and fit_sample? I'm not familiar with them.

TomAugspurger avatar Mar 02 '20 14:03 TomAugspurger

The core fit_resample function is from within imblearn/base.py.
It is incorporated throughout the imblearn library - for example here is the implementation within imblearn pipeline

sephib avatar Mar 03 '20 09:03 sephib

Thanks. The standard sklearn.pipeline.Pipeline works well with dask containers. Does the one in imblearn work with Dask objects? If not, what breaks?

On Tue, Mar 3, 2020 at 3:43 AM sephib [email protected] wrote:

The core fit_resample https://github.com/scikit-learn-contrib/imbalanced-learn/blob/6b3c5ae/imblearn/base.py#L54 function is from within imblearn/base.py https://github.com/scikit-learn-contrib/imbalanced-learn/blob/6b3c5ae/imblearn/base.py . It is incorporated throughout the imblearn library - for example here is the implementation within imblearn pipeline https://github.com/scikit-learn-contrib/imbalanced-learn/blob/6b3c5aed61f2e5dc0e8af87d97ea92b95dcafdd0/imblearn/pipeline.py#L333

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/317?email_source=notifications&email_token=AAKAOIXTMCM6CANWCZF4BSLRFTGKRA5CNFSM4FMLFMH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENSY6EI#issuecomment-593858321, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIUBOVBXDONJBYWE3CLRFTGKRANCNFSM4FMLFMHQ .

TomAugspurger avatar Mar 03 '20 12:03 TomAugspurger

Currently when I ran daskml with an imblearn pipeline I got an error:

AttributeError: 'FunctionSamplerw object has no attribute 'transform'

This is from the dask_ml/model_selection/method.py fit_transform function which is looking for a fit_transform or fit and transform attributes (which are in imblearn "converted" to fit_resample

sephib avatar Mar 04 '20 07:03 sephib

It'd would help to have a minimal minimal example: http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

TomAugspurger avatar Mar 04 '20 12:03 TomAugspurger

Hi Here is a sample code that passes the dask_ml/model_selection/methods.py . Unfortunately it still does not pass the /imblearn/base.py file but I think it may be something with the example

when amending the file with

from imblearn.pipeline import Pipeline

instead of

from sklearn.pipeline import Pipeline

and adding these lines into the fit_transform function after line 260

elif hasattr(est, "fit_resample"):
                Xt = est.fit_resample(X, y, **fit_params)
rom sklearn.model_selection import train_test_split as tts
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier as KNN
from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import (EditedNearestNeighbours,
                                     RepeatedEditedNearestNeighbours)
import dask_ml.model_selection as dcv
from sklearn.model_selection import GridSearchCV

# Generate the dataset
X, y = make_classification(n_classes=2, class_sep=1.25, weights=[0.3, 0.7],
                           n_informative=3, n_redundant=1, flip_y=0,
                           n_features=5, n_clusters_per_class=1,
                           n_samples=5000, random_state=10)

# Instanciate a PCA object for the sake of easy visualisation
pca = PCA(n_components=2)

# Create the samplers
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()

# Create the classifier
knn = KNN(1)

# Make the splits
X_train, X_test, y_train, y_test = tts(X, y, random_state=42)

# Add one transformers and two samplers in the pipeline object
pipeline = make_pipeline(pca, enn, renn, knn)
param_grid = {"pca__n_components":[1, 2, 3],}

# grid = GridSearchCV(pipeline, param_grid=param_grid)
grid = dcv.GridSearchCV(pipeline, param_grid=param_grid)

grid.fit(X_train, y_train)

Any inputs would be appreciated

sephib avatar Mar 09 '20 08:03 sephib

Thanks. So the issue is with dask_ml.model_selection.GridSearchCV? I'm confused about how this would work with scikit-learn, since (AFAIK) fit_resample isn't part of their API.

TomAugspurger avatar Mar 09 '20 15:03 TomAugspurger

That's the magic of imblearn.pipeline (if you un-comment the dvc.GirdSearchCV and leave the sklearn GridSearchCV the code runs without any errors).

sephib avatar Mar 09 '20 21:03 sephib

I don't really see how that would work. But feel free to propose changes in a PR and we can discuss that there.

On Mon, Mar 9, 2020 at 4:50 PM sephib [email protected] wrote:

That's the magic of imblearn.pipeline (if you un-comment the dvc.GirdSearchCV and leave the sklearn GridSearchCV the code runs without any errors).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/317?email_source=notifications&email_token=AAKAOIS7W5XODNWP5WGBPMDRGVQAZA5CNFSM4FMLFMH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOJGJKA#issuecomment-596796584, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQ7ML7HENASN5PPL5LRGVQAZANCNFSM4FMLFMHQ .

TomAugspurger avatar Mar 10 '20 13:03 TomAugspurger

@TomAugspurger

I started a POC to adapt our RandomUnderSampler to support natively dask array and dataframe (in/out): https://github.com/scikit-learn-contrib/imbalanced-learn/pull/777

I think that we can do something similar for both RandomOverSampler and ClusterCentroids. They don't rely on kNN and thus make it possible to work in a distributed setting. The other methods rely on kNN and I am not sure that it would be easy to do anything then.

Regarding the integration with the imbalanced-learn Pipeline, our implementation is exactly the one of scikit-learn but we check if a sampler is within the pipeline. This check looks for the attribute fit_resample which would be applied only during fit of the pipeline. Thus, I would say that you can safely use imblearn.Pipeline in replacement of the sklearn.Pipeline.

I was wondering if you would have a bit of time just to check if, on the dask part, we don't implement something stupid in the above PR (I am not super familiar yet with distributed computation).

glemaitre avatar Nov 06 '20 10:11 glemaitre

Regarding the integration with the imbalanced-learn Pipeline, our implementation is exactly the one of scikit-learn but we check if a sampler is within the pipeline. This check looks for the attribute fit_resample which would be applied only during fit of the pipeline. Thus, I would say that you can safely use imblearn.Pipeline in replacement of the sklearn.Pipeline.

@TomAugspurger is a PR still relevant? if so i'll be happy to get some guidance

sephib avatar Nov 06 '20 13:11 sephib

I'm not sure what's required, but perhaps imbalanced-learn's Pipeline will just be able to accept Dask collections after that pull request? I don't know what estimators like GridSearchCV need to do (if anything) to work with imbalanced-learn pipelines.

On Fri, Nov 6, 2020 at 7:37 AM sephib [email protected] wrote:

Regarding the integration with the imbalanced-learn Pipeline, our implementation is exactly the one of scikit-learn but we check if a sampler is within the pipeline. This check looks for the attribute fit_resample which would be applied only during fit of the pipeline. Thus, I would say that you can safely use imblearn.Pipeline in replacement of the sklearn.Pipeline.

@TomAugspurger https://github.com/TomAugspurger is a PR still relevant? if so i'll be happy to get some guidance

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/317#issuecomment-723084512, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIWQIZFLMFA25KATC5DSOP3YRANCNFSM4FMLFMHQ .

TomAugspurger avatar Nov 06 '20 14:11 TomAugspurger

I guess we can see how the @glemaitre PR goes through and then see if there is anything else to do on dask-ml side

sephib avatar Nov 07 '20 21:11 sephib

Does imblearn supports Dask Natively?? I have been using joblib with parallel_backend = "dask" for it, but it seems that it is not able to parallelize my tasks.

vishalvvs avatar Sep 01 '22 09:09 vishalvvs

Any updates on this? For example, could I use RandomOverSampler if I use @glemaitre 's PR with minor changes? Thank you in advance!

Jose-Bastos avatar Apr 09 '24 18:04 Jose-Bastos