skops icon indicating copy to clipboard operation
skops copied to clipboard

Have labels in config.json for text-classification

Open merveenoyan opened this issue 2 years ago • 10 comments

We need to provide class labels for widget in config.json of text-classification models. See this line in api-inference-community. (I think it's good to keep it regardless of widget, it can be used in inference related utils we might have and Gradio as well.). @skops-dev/maintainers

merveenoyan avatar Aug 29 '22 16:08 merveenoyan

I made that argument previously and was told no :)

BenjaminBossan avatar Aug 29 '22 16:08 BenjaminBossan

@BenjaminBossan I think I'm with you on this one. if people are able to simply grab a model from the Hub, it just doesn't make any sense not to have the class probability mapping.

merveenoyan avatar Aug 29 '22 16:08 merveenoyan

We don't need anything in the config file to do that:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

data = load_iris(as_frame=True)
y = [data.target_names[x] for x in data.target]
X = data.data
X, _, y, _ = train_test_split(X, y)

model = LogisticRegression(solver="liblinear").fit(X, y)

print(model.predict(X[:3]))
print(model.predict_proba(X[:3]))
print(model.classes_)

which gives this output:

['setosa' 'virginica' 'setosa']
[[8.97459211e-01 1.02358882e-01 1.81907728e-04]
 [2.65548039e-05 4.01995225e-01 5.97978220e-01]
 [8.80081786e-01 1.19875993e-01 4.22209155e-05]]
['setosa' 'versicolor' 'virginica']

So we have all the info from the model itself.

adrinjalali avatar Aug 30 '22 11:08 adrinjalali

@adrinjalali This is not always the case, e.g. scikit-learn's twenty news groups dataset has them as numbers.

>>> twenty_train.target
array([1, 1, 3, ..., 2, 2, 2])

Don't know if it's an edge case though but it's good if we prompt user to give labels if that is. (or, for instance if target is label encoded or ordinal encoded, does it still work? I know ordinal encoded things might work for regression but still 😅)

merveenoyan avatar Aug 30 '22 12:08 merveenoyan

if we prompt user to give labels

At least having the option to explicitly pass the labels would be nice IMHO. If they are not passed, default to estimator.classes_ and if those don't exist, fall back to ints.

BenjaminBossan avatar Aug 30 '22 12:08 BenjaminBossan

If you look at my script here, I'm using target_names to get the names. If the model gets numbers, it outputs numbers, if it gets labels, it returns labels. If the model is returning numbers, I really don't think we should try to transform, since we don't know how they're mapped to those strings. And that's why I don't think we need this info in the config file. The model should always be the source of truth when it comes to what the labels are. The post processing of those numbers can only make things go wrong.

adrinjalali avatar Aug 31 '22 07:08 adrinjalali

Other text-classification models on Hub has mapping attributes called id2label and label2id in config, like below:

"id2label": {
--
  | "0": "negative",
  | "1": "neutral",
  | "2": "positive"
  | },
  | "initializer_range": 0.02,
  | "intermediate_size": 3072,
  | "label2id": {
  | "negative": 0,
  | "neutral": 1,
  | "positive": 2
  | },

So you don't really mess things up. The point of having widgets (and later gradio demos) on Hub is for other people to see if the model suits their use case (which can be inferred from inputs and target labels) Hence, I'm still in favor of having the above structure in our config.json.

merveenoyan avatar Aug 31 '22 11:08 merveenoyan

That's probably because in DL people are used to preprocess input before giving the data to the model, and then postprocess the output, but in sklearn the model (or the pipeline) is doing all/most of that work.

adrinjalali avatar Aug 31 '22 12:08 adrinjalali

I'll go with @adrinjalali's suggestion, WDYT @BenjaminBossan?

merveenoyan avatar Sep 06 '22 16:09 merveenoyan

@merveenoyan I think eventually it would be nice to have the possibility for a classifier that outputs ints but whose widget still shows the class names. But it's not high priority for me, so I'm fine with the current solution.

BenjaminBossan avatar Sep 06 '22 16:09 BenjaminBossan

Should we close this? @adrinjalali @BenjaminBossan

merveenoyan avatar Nov 07 '22 16:11 merveenoyan

Yep.

adrinjalali avatar Nov 07 '22 17:11 adrinjalali