imbalanced-learn icon indicating copy to clipboard operation
imbalanced-learn copied to clipboard

[ENH] Add sample_indices_ for SMOTE/ADASYN classes

Open glemaitre opened this issue 4 years ago • 5 comments

SMOTE/ADASYN classes currently do not provide a sample_indices_ attribute since they are generating samples that do not belong to the original dataset.

However, we could create a new semantic for these samplers that generate data. sample_indices_ could expose a tuple of the sample used to generate the new point. For the samples that are not generated, it will only be a single integer.

This would implement a feature requested in issues and gitter.

glemaitre avatar Nov 02 '20 10:11 glemaitre

Thinking a bit more about it and after reading about #724, I think that we should avoid reusing sample_indices_ that would have another semantic. However, we could provide a new attribute that would have a proper semantic for the SMOTE-like sampler.

glemaitre avatar Feb 15 '21 23:02 glemaitre

I was thinking on the same issue because I need the sample indices for GroupKFold CV after oversampling using SMOTE. So I downloaded the repo and made some small local changes to imblearn/over_sampling/_smote/base.py/. The codes to oversample are the same:

import numpy as np
from imblearn.over_sampling import SMOTE as smo
X=np.random.random((8,3))
y=np.array([0,0,2,0,2,2,2,2])
oversample=smo(k_neighbors=2)
X_,y_=oversample.fit_resample(X,y)

By calling oversample.sample_indices(), it returns:

array([0, 1, 2, 3, 4, 5, 6, 7, 1, 3])

where the indice of the synthetic sample is the same as its "mother" real sample.

One can also call oversample.sample_indices(get_which_neighbors=True), which returns a list of tuples indicating which neighbor the synthetic sample was generated from:

[(0, 0),
 (1, 0),
 (2, 0),
 (3, 0),
 (4, 0),
 (5, 0),
 (6, 0),
 (7, 0),
 (1, 1),
 (3, 1)]

For real sample, its neighbor is 0 (itself). Please let me know if this is also what you have base.txt

in mind! If you think it is implementable I can open a new branch.

tianlinhe avatar Apr 01 '21 09:04 tianlinhe

Hi! Thanks for creating this issue. I think this feature can be useful to understand datasets we are working with.

Thinking a bit more about it and after reading about #724, I think that we should avoid reusing sample_indices_ that would have another semantic. However, we could provide a new attribute that would have a proper semantic for the SMOTE-like sampler.

@glemaitre, IMO, semantic should be given by owners of datasets. If we use the example of #724, oversample the data and suppose we use sample_indices_ as a tuple of the sample used to generate the new point, we will expect people generating new points (i.e., new people).

WDYT?

nhm-7 avatar May 17 '21 19:05 nhm-7

Hi, Is this issue still open? I see there was an PR but it seems outdated.

JurajSlivka avatar Oct 04 '22 06:10 JurajSlivka

Hi, Is this issue still open? I see there was an PR but it seems outdated.

So as it seems that no one is currently working on it, I will do it.

JurajSlivka avatar Oct 09 '22 14:10 JurajSlivka