imbalanced-learn
imbalanced-learn copied to clipboard
[ENH] Add sample_indices_ for SMOTE/ADASYN classes
SMOTE/ADASYN classes currently do not provide a sample_indices_
attribute since they are generating samples that do not belong to the original dataset.
However, we could create a new semantic for these samplers that generate data. sample_indices_
could expose a tuple of the sample used to generate the new point. For the samples that are not generated, it will only be a single integer.
This would implement a feature requested in issues and gitter.
Thinking a bit more about it and after reading about #724, I think that we should avoid reusing sample_indices_
that would have another semantic. However, we could provide a new attribute that would have a proper semantic for the SMOTE
-like sampler.
I was thinking on the same issue because I need the sample indices for GroupKFold CV after oversampling using SMOTE. So I downloaded the repo and made some small local changes to imblearn/over_sampling/_smote/base.py/
. The codes to oversample are the same:
import numpy as np
from imblearn.over_sampling import SMOTE as smo
X=np.random.random((8,3))
y=np.array([0,0,2,0,2,2,2,2])
oversample=smo(k_neighbors=2)
X_,y_=oversample.fit_resample(X,y)
By calling oversample.sample_indices()
, it returns:
array([0, 1, 2, 3, 4, 5, 6, 7, 1, 3])
where the indice of the synthetic sample is the same as its "mother" real sample.
One can also call oversample.sample_indices(get_which_neighbors=True)
, which returns a list of tuples indicating which neighbor the synthetic sample was generated from:
[(0, 0),
(1, 0),
(2, 0),
(3, 0),
(4, 0),
(5, 0),
(6, 0),
(7, 0),
(1, 1),
(3, 1)]
For real sample, its neighbor is 0 (itself). Please let me know if this is also what you have base.txt
in mind! If you think it is implementable I can open a new branch.
Hi! Thanks for creating this issue. I think this feature can be useful to understand datasets we are working with.
Thinking a bit more about it and after reading about #724, I think that we should avoid reusing
sample_indices_
that would have another semantic. However, we could provide a new attribute that would have a proper semantic for theSMOTE
-like sampler.
@glemaitre, IMO, semantic should be given by owners of datasets. If we use the example of #724, oversample the data and suppose we use sample_indices_
as a tuple of the sample used to generate the new point, we will expect people generating new points (i.e., new people).
WDYT?
Hi, Is this issue still open? I see there was an PR but it seems outdated.
Hi, Is this issue still open? I see there was an PR but it seems outdated.
So as it seems that no one is currently working on it, I will do it.