scikit-learn TransformedTargetClassifier

Describe the workflow you want to enable

I would like to be able to use a LabelEncoder as a wrapper for a classifier, similarly to what can be achieved with preprocessors on the y value for regressors via the TransformedTargetRegressor.

Describe your proposed solution

Add a class TransformedTargetClassifier that accepts both a transformer on y and a classifier.

Describe alternatives you've considered, if relevant

An alternative would be to use the voting classifier with a single estimator, but that appears to be misusing that class.

Additional context

I'm proposing this feature because in Auto-sklearn we use the LabelEncoder on a call to fit to have all classifiers we try use a simple, encoded representation. When using the Auto-sklearn classes we can undo the transformations ourselves. However, if the user would like to access an individual model, there's no way we can wrap the LabelEncoder around this models.

CC @eddiebergman

Sep 06 '21 08:09 mfeurer

A bit of a workaround: we are not limiting the regressor to be a regressor. The following should indeed work:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import LabelEncoder

tt = TransformedTargetRegressor(regressor=LogisticRegression(),
				transformer=LabelEncoder())
X, y = make_classification()
tt.fit(X, y)
print(tt.score(X, y))

LabelEncoder on a call to fit to have all classifiers we try use a simple, encoded representation.

I think that most of the scikit-learn estimators already do that internally and call inverse_transform at predict. This is also linked with the classes_ fitted attribute.

Sep 06 '21 09:09 glemaitre

Thanks @glemaitre for that suggestion. Indeed, it gets us 90% of where we would like to be, but I see two small issues:

it is still called Regressor and is also a RegressorMixin
it doesn't support predict_proba

As I can see it right now it would be rather simple to extract all relevant code into a parent class and provide both a regressor and classifier. Please let me know if you'd be interested in that.

Regarding what the estimators do by themselves, we're aware of that, but to keep Auto-sklearn as simple as possible we aim to only handle integer classes internally. So one could say that the Auto-sklearn estimator aims to do that internally and we just need to find a way how to provide the user with the internal models in case they want to see them.

Sep 06 '21 13:09 mfeurer

Regarding what the estimators do by themselves, we're aware of that, but to keep Auto-sklearn as simple as possible we aim to only handle integer classes internally. So one could say that the Auto-sklearn estimator aims to do that internally and we just need to find a way how to provide the user with the internal models in case they want to see them.

For a library, I really like the idea of separating out the logic for encoding the target and having models that only support integers internally. As for inclusion, I think there needs to be a use case besides encoding the target, because sklearn classifiers handles the encoding internally. @mfeurer Do you see another use case for TransformedTargetClassifier besides encoding the target?

Sep 12 '21 16:09 thomasjpfan

So one could say that the Auto-sklearn estimator aims to do that internally and we just need to find a way how to provide the user with the internal models in case they want to see them.

@mfeurer I'm wondering, do you want to provide an interface to output all the internally implemented models to users? If so, I can try to do it.

Sep 14 '21 07:09 ChenBinfighting1

Regarding what the estimators do by themselves, we're aware of that, but to keep Auto-sklearn as simple as possible we aim to only handle integer classes internally. So one could say that the Auto-sklearn estimator aims to do that internally and we just need to find a way how to provide the user with the internal models in case they want to see them.

For a library, I really like the idea of separating out the logic for encoding the target and having models that only support integers internally. As for inclusion, I think there needs to be a use case besides encoding the target, because sklearn classifiers handles the encoding internally. @mfeurer Do you see another use case for TransformedTargetClassifier besides encoding the target?

From discussion https://github.com/scikit-learn/scikit-learn/discussions/22171, if the target is string, then the workaround of using the existing TransformedTargetRegressor won't work. Would this be considered a different use case?

This became a problem because xgboost removes the label transform inside their sklearn interface since 1.6.0. See their release note (https://github.com/dmlc/xgboost/blob/master/NEWS.md#v160-2022-apr-16) and the relevant PR (https://github.com/dmlc/xgboost/pull/7357).

Jul 19 '22 22:07 wenfeiy-db

For our use case we want to be able to show predictions as either the original text labels or the probability of a given label. PyCaret has a TransformedTargetClassifier implementation that can return the original text labels, but it doesn't implement predict_proba, so it only gives us half of what we need. It would be great if scikit-learn had a full implementation of TransformedTargetClassifier. Here's an example using PyCaret's implementation:

import pandas as pd
from sklearn.naive_bayes import CategoricalNB
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from pycaret.internal.preprocess.target.TransformedTargetClassifier import TransformedTargetClassifier


csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
col_names = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'species']
df = pd.read_csv(csv_url, names=col_names)
y = df.pop('species')
X = df

pipeline = make_pipeline(TransformedTargetClassifier(classifier=CategoricalNB(), transformer=LabelEncoder()))
pipeline.fit(X, y)
predictions = pipeline.predict(X)[[10, 25, 50]]
print(predictions)
# ['Iris-setosa' 'Iris-setosa' 'Iris-versicolor']

Oct 17 '22 20:10 marcdhansen

@marcdhansen raises an excellent point!

The inability to seamlessly integrate target label preprocessing into scikit-learn's classification pipelines is a notable limitation. As I work on projects with XGBoost and utilize the scikit-learn pipeline structure for end-to-end processing, I've encountered challenges in achieving a streamlined workflow. It would be highly beneficial if scikit-learn could consider implementing a feature that allows for convenient target label preprocessing within classification pipelines, making the framework even more versatile and user-friendly.

Additionally, it's worth noting that the proposed alternative, TransformedTargetRegressor, doesn't support predict_proba, which further underscores the need for a dedicated solution for classification pipelines.

Dec 27 '23 20:12 brenowca

As I work on projects with XGBoost and utilize the scikit-learn pipeline structure for end-to-end processing, I've encountered challenges in achieving a streamlined workflow

But why shouldn't this be fixed in XGBoost?

On a more general note, I do think that the suggested TransformedTargetClassifier would be a good addition to scikit-learn

Apr 17 '24 09:04 GaelVaroquaux

Sorry to jump in, but I have been following discussions around target transformations and I would gladly take on the task of adding a TransformedTargetClassifier to sklearn.

@glemaitre @GaelVaroquaux Would that be fine with you if I open a PR?

Sep 21 '24 09:09 gtauzin

/take

Sep 28 '24 15:09 gtauzin