imbalanced-learn icon indicating copy to clipboard operation
imbalanced-learn copied to clipboard

Sample selection bias and up/down-sampling

Open rth opened this issue 6 years ago • 5 comments

It's a bit of an open-ended question. In my understanding up/down-sampling the input data depending on the target class is equivalent to having a dataset with sample selection bias. The possible impact of the latter on ML models is discussed e.g. by Zadrozny 2004.

In the use case of imbalanced-learn I gather that is not an issue because the sample selection only happens depending on the target variable y, not any of the features in X? (which corresponds to case 2 on page 2 of the above-linked paper).

An orthogonal question: assuming we do have some dataset with sample selection bias based on some feature in X (case 3, page 2 of the same paper). In other words, the distribution of one of the column of X does not match the real world distribution and we would like to compensate for it. Could one of the approaches in imbalanced-learn be used (or adapted) for it? Would something like this be in the scope of this project?

rth avatar Feb 05 '19 22:02 rth

I would say yes. Then, we would need to think about the right module to do that.

In other words, the distribution of one of the columns of X does not match the real world distribution and we would like to compensate for it.

I did not look at the paper yet but is it related to importance sampling in which you would like to sample the X column such that it follows a given "real-world" distribution.

In the case of over-sampling, we could think about something similar in which you could estimate distribution (or parameters such as covariances) from other datasets and use this in the rebalancing procedure. It would be a kind of data augmentation using knowledge from data instead of randomly generation.

I would be really interested to implement such stuff or helping for it.

glemaitre avatar Feb 06 '19 18:02 glemaitre

We should include some of these in 1.X

glemaitre avatar Nov 17 '19 11:11 glemaitre

@rth Did you see some of the methods in the literature. Probably we should look at the fairness papers.

glemaitre avatar Nov 17 '19 11:11 glemaitre

I have not really looked into this question since opening this issue in February..

rth avatar Nov 17 '19 11:11 rth

Well,

Probably we should look at the fairness papers.

Yes. There is a body of research regarding this subject. I think that even this problem is imbalanced. So, we can tackle this inside imbalanced-learn. APIwise probably we may need some changes. I leave here one (of the many) relevant paper.

chkoar avatar Nov 19 '20 13:11 chkoar