TransformedTargetClassifier
Describe the workflow you want to enable
I would like to be able to use a LabelEncoder as a wrapper for a classifier, similarly to what can be achieved with preprocessors on the y value for regressors via the TransformedTargetRegressor.
Describe your proposed solution
Add a class TransformedTargetClassifier that accepts both a transformer on y and a classifier.
Describe alternatives you've considered, if relevant
An alternative would be to use the voting classifier with a single estimator, but that appears to be misusing that class.
Additional context
I'm proposing this feature because in Auto-sklearn we use the LabelEncoder on a call to fit to have all classifiers we try use a simple, encoded representation. When using the Auto-sklearn classes we can undo the transformations ourselves. However, if the user would like to access an individual model, there's no way we can wrap the LabelEncoder around this models.
CC @eddiebergman
A bit of a workaround: we are not limiting the regressor to be a regressor. The following should indeed work:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import LabelEncoder
tt = TransformedTargetRegressor(regressor=LogisticRegression(),
transformer=LabelEncoder())
X, y = make_classification()
tt.fit(X, y)
print(tt.score(X, y))
LabelEncoder on a call to fit to have all classifiers we try use a simple, encoded representation.
I think that most of the scikit-learn estimators already do that internally and call inverse_transform at predict. This is also linked with the classes_ fitted attribute.
Thanks @glemaitre for that suggestion. Indeed, it gets us 90% of where we would like to be, but I see two small issues:
- it is still called Regressor and is also a
RegressorMixin - it doesn't support
predict_proba
As I can see it right now it would be rather simple to extract all relevant code into a parent class and provide both a regressor and classifier. Please let me know if you'd be interested in that.
Regarding what the estimators do by themselves, we're aware of that, but to keep Auto-sklearn as simple as possible we aim to only handle integer classes internally. So one could say that the Auto-sklearn estimator aims to do that internally and we just need to find a way how to provide the user with the internal models in case they want to see them.
Regarding what the estimators do by themselves, we're aware of that, but to keep Auto-sklearn as simple as possible we aim to only handle integer classes internally. So one could say that the Auto-sklearn estimator aims to do that internally and we just need to find a way how to provide the user with the internal models in case they want to see them.
For a library, I really like the idea of separating out the logic for encoding the target and having models that only support integers internally. As for inclusion, I think there needs to be a use case besides encoding the target, because sklearn classifiers handles the encoding internally. @mfeurer Do you see another use case for TransformedTargetClassifier besides encoding the target?
So one could say that the Auto-sklearn estimator aims to do that internally and we just need to find a way how to provide the user with the internal models in case they want to see them.
@mfeurer I'm wondering, do you want to provide an interface to output all the internally implemented models to users? If so, I can try to do it.
Regarding what the estimators do by themselves, we're aware of that, but to keep Auto-sklearn as simple as possible we aim to only handle integer classes internally. So one could say that the Auto-sklearn estimator aims to do that internally and we just need to find a way how to provide the user with the internal models in case they want to see them.
For a library, I really like the idea of separating out the logic for encoding the target and having models that only support integers internally. As for inclusion, I think there needs to be a use case besides encoding the target, because sklearn classifiers handles the encoding internally. @mfeurer Do you see another use case for
TransformedTargetClassifierbesides encoding the target?
From discussion https://github.com/scikit-learn/scikit-learn/discussions/22171, if the target is string, then the workaround of using the existing TransformedTargetRegressor won't work. Would this be considered a different use case?
This became a problem because xgboost removes the label transform inside their sklearn interface since 1.6.0. See their release note (https://github.com/dmlc/xgboost/blob/master/NEWS.md#v160-2022-apr-16) and the relevant PR (https://github.com/dmlc/xgboost/pull/7357).
For our use case we want to be able to show predictions as either the original text labels or the probability of a given label. PyCaret has a TransformedTargetClassifier implementation that can return the original text labels, but it doesn't implement predict_proba, so it only gives us half of what we need. It would be great if scikit-learn had a full implementation of TransformedTargetClassifier. Here's an example using PyCaret's implementation:
import pandas as pd
from sklearn.naive_bayes import CategoricalNB
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from pycaret.internal.preprocess.target.TransformedTargetClassifier import TransformedTargetClassifier
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
col_names = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'species']
df = pd.read_csv(csv_url, names=col_names)
y = df.pop('species')
X = df
pipeline = make_pipeline(TransformedTargetClassifier(classifier=CategoricalNB(), transformer=LabelEncoder()))
pipeline.fit(X, y)
predictions = pipeline.predict(X)[[10, 25, 50]]
print(predictions)
# ['Iris-setosa' 'Iris-setosa' 'Iris-versicolor']
@marcdhansen raises an excellent point!
The inability to seamlessly integrate target label preprocessing into scikit-learn's classification pipelines is a notable limitation. As I work on projects with XGBoost and utilize the scikit-learn pipeline structure for end-to-end processing, I've encountered challenges in achieving a streamlined workflow. It would be highly beneficial if scikit-learn could consider implementing a feature that allows for convenient target label preprocessing within classification pipelines, making the framework even more versatile and user-friendly.
Additionally, it's worth noting that the proposed alternative, TransformedTargetRegressor, doesn't support predict_proba, which further underscores the need for a dedicated solution for classification pipelines.
As I work on projects with XGBoost and utilize the scikit-learn pipeline structure for end-to-end processing, I've encountered challenges in achieving a streamlined workflow
But why shouldn't this be fixed in XGBoost?
On a more general note, I do think that the suggested TransformedTargetClassifier would be a good addition to scikit-learn
Sorry to jump in, but I have been following discussions around target transformations and I would gladly take on the task of adding a TransformedTargetClassifier to sklearn.
@glemaitre @GaelVaroquaux Would that be fine with you if I open a PR?
/take