auto-sklearn
auto-sklearn copied to clipboard
[Question] Does sklearn's clustering algorithm have a semi-supervised function?
For example, when using KMeans, input some data that already has labels to assist in clustering or initialize the cluster_centers_ ?
Thanks and look forward to your reply.
Hi @wulaoshi,
We only use the KMeansClassifier and KMeansRegressor from sklearn which you can find here and here. So we do pass the labels to fit if that is what you mean? If you don't mean that, maybe you could elaborate with a small code snippet?
Hi @wulaoshi,
We only use the
KMeansClassifierandKMeansRegressorfromsklearnwhich you can find here and here. So we do pass the labels to fit if that is what you mean? If you don't mean that, maybe you could elaborate with a small code snippet?
Hi eddiebergman, Thanks ur reply. I may not have described my question very clearly. The pseudocode looks like this:
from sklearn.model_selection import train_test_split
from sklearn import datasets
digists = datasets.load_digits()
X_train, X_test, y_train, y_test = train_test_split(digists.data, digists.target, test_size=0.5)
X_train0, X_train1, y_train0, _ = train_test_split(X_train, y_train, test_size=0.95)
km = SemiKMeans(n_clusters=20)
km.fit(X_train0, y_train0, X_train1)
We have labeled some of the data, i.e., X_train0 and y_train0. y_train0 likes [3 9 6 8 1 4 6 7 6 9 5 9 0 0 0 9 4 3 3 2 2 5 2 9 2 7 3 4 4 8 2 3 2 2 8 3 9 ...], which means the label or category to which X_train0 belongs.
In this way, we hope to make the clustering results more in line with expectations.
Thanks.
So I'm not really sure what XTrain1 represents in fit(X0, y0, X1) but I assume this is part of the typical setup of semi-supervised. So the answer then is likely no, we don't support it, one of the reasons for this would be it's incompatibility with the other parts of the pipeline which rely on the classic fit(x, y).
Another issue would be how do we do resampling strategies with respect to this such that they respect both cross-validation and holdout. The issue here is more with my own knowledge of typical semi-supervised learning code and likely solvable.
However to combat point one, it's likely possible to just do that splitting internally in your component itself.
class MySSLKMeans(AutoSklearnClassifier):
def __init__(
self,
n_clusters: int | None = None,
internal_splitsize: float | None = 0.95
):
...
def fit(self, X, y) -> MySSLKmeans:
X_train0, X_train1, y_train0, _ = train_test_split(X, y, test_size=self.internal_splitsize)
self.estimator = SemiKMeans(n_clusters=self.n_cluster)
self.estimator.fit(X_train0, y_train0, X_train1)
return self
So I'm not really sure what
XTrain1represents infit(X0, y0, X1)but I assume this is part of the typical setup of semi-supervised. So the answer then is likely no, we don't support it, one of the reasons for this would be it's incompatibility with the other parts of the pipeline which rely on the classicfit(x, y).Another issue would be how do we do resampling strategies with respect to this such that they respect both
cross-validationandholdout. The issue here is more with my own knowledge of typical semi-supervised learning code and likely solvable.However to combat point one, it's likely possible to just do that splitting internally in your component itself.
class MySSLKMeans(AutoSklearnClassifier): def __init__( self, n_clusters: int | None = None, internal_splitsize: float | None = 0.95 ): ... def fit(self, X, y) -> MySSLKmeans: X_train0, X_train1, y_train0, _ = train_test_split(X, y, test_size=self.internal_splitsize) self.estimator = SemiKMeans(n_clusters=self.n_cluster) self.estimator.fit(X_train0, y_train0, X_train1) return self
Thank you, I understand. I'll write a related code to solve my problem.