mteb
mteb copied to clipboard
WIP: Multilabel classification
Working on #434. I will still have to add a good test task, if anyone has one don't hesitate to comment.
Maybe this? https://huggingface.co/datasets/coastalcph/multi_eurlex
I'll look into it, thanks @isaac-chung !!
I'm currently in the process of adding EURLEX.
Currently this PR assumes that all labels in the classification are independent from each other.
This is due to MultiOutputClassifier
from sklearn, which trains multiple independent classifiers for each label.
Some options we could consider that would fix this:
-
ClassifierChain
which would be an optimal choice for hierarchical tasks or where the ordering of the labels is trivial. We have to be cautious to order the labels properly though, which might be a pain in a half to do, and I'm not sure whether this should be the specific tasks' or theAbsTask
's responsibility. - Using a neural network like
MLPClassifier
with multiple outputs. This would be a good option, because it does not need any ordering and does not assume the independence of labels, but it's waaaaaay slower than just using kNN and we would also lose a great deal of conceptual transparency.
What do you guys think @KennethEnevoldsen @imenelydiaker @isaac-chung ?
I'm currently in the process of running MultiEURLEX on my machine, this might take a fair bit :D
My immediate assumption is just to go for simplicity and then we can always expand to other cases in the future.
Regarding the points: Do we count MultilabelClassification
as a new task for each language contained in EURLEX or should I only add bonus points for those languages that had no classification task prior to this?
I have been running the task basically all day on UCloud on the two models, it takes a ridiculous amount of time.
@x-tabdeveloping let us discuss how to potentially speed it up monday
Yea I left it running for days and it didn't complete for a single model, I will attempt to make it faster, come by at the office if you have ideas @KennethEnevoldsen
Not sure if it helps with speed much, but noticed that the test/validation sets have 5k samples each. Maybe some downsampling could help.
[update]: When I ran this branch on MultiEurlex, it seems stuck here for over 5min:
Task: MultiEURLEXMultilabelClassification, split: validation, language: en. Running...
We were discussing downsampling the test set, I found other issues, however, that were affecting performance way more negatively. If the performance is still lacking after the changes, I will consider it
Test fails because of request timeout, signals a more widespread issue perhaps?
It runs way faster now, I think I will be ready to submit the results in a couple of hours. I'm trying to write subsampling for the test set in the meantime.
@KennethEnevoldsen performance is absolutely crap on the task with both models, especially on accuracy. Where do you think the issue may be?
@x-tabdeveloping my guess is that the KNN (euclidian distance I assume) is a poor fit for models using cosine-sim. A solution is to use a different distance metric for KNN or implement a different classifier. I believe a 2 layer MLP is a good alternative.
MLP would also be great for colinearity, though it might be slower and the task already takes quite a while to run (also the scikit-learn implementation is not exactly the fastest)
I can give it a go.
MLP would also be great for colinearity, though it might be slower and the task already takes quite a while to run (also the scikit-learn implementation is not exactly the fastest)
What are we talking? I don't believe classification should be limiting us here, but rather embedding time
Running on UCloud again, should be able to submit results within a day.
- It runs very slow, I couldn't complete the runs, maybe we should subsample and limit to the test set instead of the validation set and the test set.
- Performance is still crap, I have no idea what to do about that, and am at a bit of a loss as to what is happening.
I made the neural network smaller and introduced stratified subsampling for the test set so that it runs faster, I will try to do a rerun.
For what it's worth, maybe it might help to debug to use a small dataset.
Yea using a smaller dataset for test seems like the right approach.
It runs very slow, I couldn't complete the runs, maybe we should subsample and limit to the test set instead of the validation set and the test set.
Hmm any idea about what part is slow? Is it simply running the trained model on the test set? (in which case reducing the test set might be an option)
Performance is still crap, I have no idea what to do about that, and am at a bit of a loss as to what is happening.
Doing a baseline using a logistic regression on each label is probably a good idea
Something's not right with these scores, I will make a deep dive
I ran EURLEX in English with all-MiniLM-L6 with multiple classifiers (MLPClassifier
, KNN
, DummyClassifier
).
It would seem that the task is simply incredibly hard, and that accuracy is not exactly a good metric to reflect performance, maybe we should make lrap
the main score.
Also note that kNN outperforms MLP by quite a bit, I think this is mainly because of the very small training set and the model is quite parameter-rich.
My suggestion is that we roll back to kNN and make LRAP the main score, what do you think @KennethEnevoldsen ?
{
"en": {
"dummy": {
"accuracy": 0.0,
"f1": 0.0,
"lrap": 0.17113333333332317
},
"knn": {
"accuracy": 0.0396,
"f1": 0.29540945816583636,
"lrap": 0.4267690714285629
},
"mlp": {
"accuracy": 0.0082,
"f1": 0.08189335124049107,
"lrap": 0.2942032142856986
}
},
"evaluation_time": 270.71
}
Also including Dummy classifier scores gives us a relatively good idea of chance level in this multilabel case.
I would not include it in the task, but it might be interesting to just have a "random" model as a baseline.
- A couple of thoughts. It might be worth increasing the training set size for the MLP
- It might be fine with just KNN, alternatively we can do KNN + MLP and take the best (similar to clf)
- It might be worth getting performance scores for subcategories (though in this case, it is 100+ right?)
- I would also like an experiment using the base e5 just to see that larger models actually perform better
E5 definitely performs better on the task than paraphrase-multilingual. I'm not sure about the subcategories, might be a bit too much for some tasks. Though we could include it if need be. In my experiments kNN uniformly performs better even with larger training set sizes. I suppose if it grows even larger it would surpass kNN, but we're already fighting performance issues with the benchmark, I think the less we have to embed the better.
Also specific tasks are free to use whatever they want, like if you see an MLP more fit you can specify it in the task. What are your thoughts on the PR right now @KennethEnevoldsen ? Should we merge or is there something that still should be addressed
I believe it is fine to merge