skops
skops copied to clipboard
Have labels in config.json for text-classification
We need to provide class labels for widget in config.json
of text-classification
models. See this line in api-inference-community
. (I think it's good to keep it regardless of widget, it can be used in inference related utils we might have and Gradio as well.). @skops-dev/maintainers
I made that argument previously and was told no :)
@BenjaminBossan I think I'm with you on this one. if people are able to simply grab a model from the Hub, it just doesn't make any sense not to have the class probability mapping.
We don't need anything in the config file to do that:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
data = load_iris(as_frame=True)
y = [data.target_names[x] for x in data.target]
X = data.data
X, _, y, _ = train_test_split(X, y)
model = LogisticRegression(solver="liblinear").fit(X, y)
print(model.predict(X[:3]))
print(model.predict_proba(X[:3]))
print(model.classes_)
which gives this output:
['setosa' 'virginica' 'setosa']
[[8.97459211e-01 1.02358882e-01 1.81907728e-04]
[2.65548039e-05 4.01995225e-01 5.97978220e-01]
[8.80081786e-01 1.19875993e-01 4.22209155e-05]]
['setosa' 'versicolor' 'virginica']
So we have all the info from the model itself.
@adrinjalali This is not always the case, e.g. scikit-learn's twenty news groups dataset has them as numbers.
>>> twenty_train.target
array([1, 1, 3, ..., 2, 2, 2])
Don't know if it's an edge case though but it's good if we prompt user to give labels if that is. (or, for instance if target is label encoded or ordinal encoded, does it still work? I know ordinal encoded things might work for regression but still 😅)
if we prompt user to give labels
At least having the option to explicitly pass the labels would be nice IMHO. If they are not passed, default to estimator.classes_
and if those don't exist, fall back to ints.
If you look at my script here, I'm using target_names
to get the names. If the model gets numbers, it outputs numbers, if it gets labels, it returns labels. If the model is returning numbers, I really don't think we should try to transform, since we don't know how they're mapped to those strings. And that's why I don't think we need this info in the config file. The model should always be the source of truth when it comes to what the labels are. The post processing of those numbers can only make things go wrong.
Other text-classification models on Hub has mapping attributes called id2label and label2id in config, like below:
"id2label": {
--
| "0": "negative",
| "1": "neutral",
| "2": "positive"
| },
| "initializer_range": 0.02,
| "intermediate_size": 3072,
| "label2id": {
| "negative": 0,
| "neutral": 1,
| "positive": 2
| },
So you don't really mess things up.
The point of having widgets (and later gradio demos) on Hub is for other people to see if the model suits their use case (which can be inferred from inputs and target labels) Hence, I'm still in favor of having the above structure in our config.json
.
That's probably because in DL people are used to preprocess input before giving the data to the model, and then postprocess the output, but in sklearn the model (or the pipeline) is doing all/most of that work.
I'll go with @adrinjalali's suggestion, WDYT @BenjaminBossan?
@merveenoyan I think eventually it would be nice to have the possibility for a classifier that outputs ints but whose widget still shows the class names. But it's not high priority for me, so I'm fine with the current solution.
Should we close this? @adrinjalali @BenjaminBossan
Yep.