mteb
mteb copied to clipboard
Multiple label annotated datasets
Currently, mmteb has two datasets: a Turkish multidomain product review dataset and a Kurdish sentiment dataset. These datasets contain categorical annotations such as domain tag, in addition to sentiment labels. I am wondering whether we should create a new dataset that includes these categorical annotations. I believe it could be beneficial for low-resource languages. Waiting for your opinions :)
If you're talking about Multilabel classification tasks, I think this PR https://github.com/embeddings-benchmark/mteb/pull/440 may be interesting 🙂
It is different than multiple-label classification, perhaps I phrased it incorrectly. They are datasets that have both Sentiment and also category labels for the given samples. I have only uploaded their sentiment part, my question is whether I should also add this category classification task as well?
I guess you can create a new task using the same dataset and only change the label column? But since it's the same dataset and text, I don't know if changing the classes is relevant to the benchmark?
They are datasets that have both Sentiment and also category labels for the given samples
That seems to me to be multilabel classification (as opposed to multiclass).
I think you can frame it as a multilabel task, since the dataset offers 2 columns that can be used as labels. It's just that in a multilabel setting you'll try to predict both classes at the same time no?
It depends on the classifier used. A softmax classifier (eq. to an one layer MLP: embedding size --> n labels x n label types, e.g. 256, 2x3, assuming three binary label categories) would be the same as 3 independent softmax classifiers. However, if you add one layer to that MLP it does influence it.
From the PR (which is a WIP so it might change) it seem like they use:
https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html
which states
This strategy consists of fitting one classifier per target. This is a simple strategy for extending classifiers that do not natively support multi-target classification.
So essentially independent classifers.
For this a ClassifierChain
seems more appropriate as the labels are clearly dependent on each other. Chime in on the discussion at #440. I'm thinking of adding multiple options to the task so that independence does not need to be assumed but we have to discuss how this is best executed.
Ah right then you might want to add that in as well. Btw. KNN seems to support multilabel natively.
Feel free to reopen if it's still relevant.