auto-sklearn icon indicating copy to clipboard operation
auto-sklearn copied to clipboard

[Question] Does sklearn's clustering algorithm have a semi-supervised function?

Open wulaoshi opened this issue 3 years ago • 4 comments

For example, when using KMeans, input some data that already has labels to assist in clustering or initialize the cluster_centers_ ? Thanks and look forward to your reply.

wulaoshi avatar Dec 05 '22 08:12 wulaoshi

Hi @wulaoshi,

We only use the KMeansClassifier and KMeansRegressor from sklearn which you can find here and here. So we do pass the labels to fit if that is what you mean? If you don't mean that, maybe you could elaborate with a small code snippet?

eddiebergman avatar Dec 05 '22 09:12 eddiebergman

Hi @wulaoshi,

We only use the KMeansClassifier and KMeansRegressor from sklearn which you can find here and here. So we do pass the labels to fit if that is what you mean? If you don't mean that, maybe you could elaborate with a small code snippet?

Hi eddiebergman, Thanks ur reply. I may not have described my question very clearly. The pseudocode looks like this:

  from sklearn.model_selection import train_test_split
  from sklearn import datasets
  digists = datasets.load_digits()
  X_train, X_test, y_train, y_test = train_test_split(digists.data, digists.target, test_size=0.5)
  X_train0, X_train1, y_train0, _ = train_test_split(X_train, y_train, test_size=0.95)

  km = SemiKMeans(n_clusters=20)
  km.fit(X_train0, y_train0, X_train1) 

We have labeled some of the data, i.e., X_train0 and y_train0. y_train0 likes [3 9 6 8 1 4 6 7 6 9 5 9 0 0 0 9 4 3 3 2 2 5 2 9 2 7 3 4 4 8 2 3 2 2 8 3 9 ...], which means the label or category to which X_train0 belongs. In this way, we hope to make the clustering results more in line with expectations. Thanks.

wulaoshi avatar Dec 08 '22 03:12 wulaoshi

So I'm not really sure what XTrain1 represents in fit(X0, y0, X1) but I assume this is part of the typical setup of semi-supervised. So the answer then is likely no, we don't support it, one of the reasons for this would be it's incompatibility with the other parts of the pipeline which rely on the classic fit(x, y).

Another issue would be how do we do resampling strategies with respect to this such that they respect both cross-validation and holdout. The issue here is more with my own knowledge of typical semi-supervised learning code and likely solvable.

However to combat point one, it's likely possible to just do that splitting internally in your component itself.

class MySSLKMeans(AutoSklearnClassifier):
    def __init__(
        self,
        n_clusters: int | None = None,
        internal_splitsize: float | None = 0.95
    ):
        ...
        
    def fit(self, X, y) -> MySSLKmeans:
        X_train0, X_train1, y_train0, _ = train_test_split(X, y, test_size=self.internal_splitsize)
        self.estimator = SemiKMeans(n_clusters=self.n_cluster)
        self.estimator.fit(X_train0, y_train0, X_train1)
        return self

eddiebergman avatar Dec 08 '22 12:12 eddiebergman

So I'm not really sure what XTrain1 represents in fit(X0, y0, X1) but I assume this is part of the typical setup of semi-supervised. So the answer then is likely no, we don't support it, one of the reasons for this would be it's incompatibility with the other parts of the pipeline which rely on the classic fit(x, y).

Another issue would be how do we do resampling strategies with respect to this such that they respect both cross-validation and holdout. The issue here is more with my own knowledge of typical semi-supervised learning code and likely solvable.

However to combat point one, it's likely possible to just do that splitting internally in your component itself.

class MySSLKMeans(AutoSklearnClassifier):
    def __init__(
        self,
        n_clusters: int | None = None,
        internal_splitsize: float | None = 0.95
    ):
        ...
        
    def fit(self, X, y) -> MySSLKmeans:
        X_train0, X_train1, y_train0, _ = train_test_split(X, y, test_size=self.internal_splitsize)
        self.estimator = SemiKMeans(n_clusters=self.n_cluster)
        self.estimator.fit(X_train0, y_train0, X_train1)
        return self

Thank you, I understand. I'll write a related code to solve my problem.

wulaoshi avatar Dec 09 '22 07:12 wulaoshi