mteb icon indicating copy to clipboard operation
mteb copied to clipboard

Multiple label annotated datasets

Open asparius opened this issue 10 months ago • 8 comments

Currently, mmteb has two datasets: a Turkish multidomain product review dataset and a Kurdish sentiment dataset. These datasets contain categorical annotations such as domain tag, in addition to sentiment labels. I am wondering whether we should create a new dataset that includes these categorical annotations. I believe it could be beneficial for low-resource languages. Waiting for your opinions :)

asparius avatar Apr 19 '24 21:04 asparius

If you're talking about Multilabel classification tasks, I think this PR https://github.com/embeddings-benchmark/mteb/pull/440 may be interesting 🙂

imenelydiaker avatar Apr 20 '24 17:04 imenelydiaker

It is different than multiple-label classification, perhaps I phrased it incorrectly. They are datasets that have both Sentiment and also category labels for the given samples. I have only uploaded their sentiment part, my question is whether I should also add this category classification task as well?

asparius avatar Apr 23 '24 13:04 asparius

I guess you can create a new task using the same dataset and only change the label column? But since it's the same dataset and text, I don't know if changing the classes is relevant to the benchmark?

imenelydiaker avatar Apr 23 '24 15:04 imenelydiaker

They are datasets that have both Sentiment and also category labels for the given samples

That seems to me to be multilabel classification (as opposed to multiclass).

KennethEnevoldsen avatar Apr 23 '24 16:04 KennethEnevoldsen

I think you can frame it as a multilabel task, since the dataset offers 2 columns that can be used as labels. It's just that in a multilabel setting you'll try to predict both classes at the same time no?

imenelydiaker avatar Apr 23 '24 16:04 imenelydiaker

It depends on the classifier used. A softmax classifier (eq. to an one layer MLP: embedding size --> n labels x n label types, e.g. 256, 2x3, assuming three binary label categories) would be the same as 3 independent softmax classifiers. However, if you add one layer to that MLP it does influence it.

From the PR (which is a WIP so it might change) it seem like they use:

https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html

which states

This strategy consists of fitting one classifier per target. This is a simple strategy for extending classifiers that do not natively support multi-target classification.

So essentially independent classifers.

KennethEnevoldsen avatar Apr 23 '24 16:04 KennethEnevoldsen

For this a ClassifierChain seems more appropriate as the labels are clearly dependent on each other. Chime in on the discussion at #440. I'm thinking of adding multiple options to the task so that independence does not need to be assumed but we have to discuss how this is best executed.

x-tabdeveloping avatar Apr 24 '24 11:04 x-tabdeveloping

Ah right then you might want to add that in as well. Btw. KNN seems to support multilabel natively.

KennethEnevoldsen avatar Apr 24 '24 12:04 KennethEnevoldsen

Feel free to reopen if it's still relevant.

isaac-chung avatar Aug 15 '24 08:08 isaac-chung