mteb icon indicating copy to clipboard operation
mteb copied to clipboard

WIP: Multilabel classification

Open x-tabdeveloping opened this issue 10 months ago • 19 comments

Working on #434. I will still have to add a good test task, if anyone has one don't hesitate to comment.

x-tabdeveloping avatar Apr 19 '24 09:04 x-tabdeveloping

Maybe this? https://huggingface.co/datasets/coastalcph/multi_eurlex

isaac-chung avatar Apr 19 '24 10:04 isaac-chung

I'll look into it, thanks @isaac-chung !!

x-tabdeveloping avatar Apr 19 '24 10:04 x-tabdeveloping

I'm currently in the process of adding EURLEX.

x-tabdeveloping avatar Apr 23 '24 14:04 x-tabdeveloping

Currently this PR assumes that all labels in the classification are independent from each other. This is due to MultiOutputClassifier from sklearn, which trains multiple independent classifiers for each label.

Some options we could consider that would fix this:

  1. ClassifierChain which would be an optimal choice for hierarchical tasks or where the ordering of the labels is trivial. We have to be cautious to order the labels properly though, which might be a pain in a half to do, and I'm not sure whether this should be the specific tasks' or the AbsTask's responsibility.
  2. Using a neural network like MLPClassifier with multiple outputs. This would be a good option, because it does not need any ordering and does not assume the independence of labels, but it's waaaaaay slower than just using kNN and we would also lose a great deal of conceptual transparency.

What do you guys think @KennethEnevoldsen @imenelydiaker @isaac-chung ?

x-tabdeveloping avatar Apr 24 '24 11:04 x-tabdeveloping

I'm currently in the process of running MultiEURLEX on my machine, this might take a fair bit :D

x-tabdeveloping avatar Apr 24 '24 11:04 x-tabdeveloping

My immediate assumption is just to go for simplicity and then we can always expand to other cases in the future.

KennethEnevoldsen avatar Apr 24 '24 11:04 KennethEnevoldsen

Regarding the points: Do we count MultilabelClassification as a new task for each language contained in EURLEX or should I only add bonus points for those languages that had no classification task prior to this?

x-tabdeveloping avatar Apr 24 '24 12:04 x-tabdeveloping

I have been running the task basically all day on UCloud on the two models, it takes a ridiculous amount of time.

x-tabdeveloping avatar Apr 25 '24 13:04 x-tabdeveloping

@x-tabdeveloping let us discuss how to potentially speed it up monday

KennethEnevoldsen avatar Apr 26 '24 21:04 KennethEnevoldsen

Yea I left it running for days and it didn't complete for a single model, I will attempt to make it faster, come by at the office if you have ideas @KennethEnevoldsen

x-tabdeveloping avatar Apr 29 '24 07:04 x-tabdeveloping

Not sure if it helps with speed much, but noticed that the test/validation sets have 5k samples each. Maybe some downsampling could help.

[update]: When I ran this branch on MultiEurlex, it seems stuck here for over 5min:

Task: MultiEURLEXMultilabelClassification, split: validation, language: en. Running...

We were discussing downsampling the test set, I found other issues, however, that were affecting performance way more negatively. If the performance is still lacking after the changes, I will consider it

x-tabdeveloping avatar Apr 29 '24 08:04 x-tabdeveloping

Test fails because of request timeout, signals a more widespread issue perhaps?

x-tabdeveloping avatar Apr 29 '24 08:04 x-tabdeveloping

It runs way faster now, I think I will be ready to submit the results in a couple of hours. I'm trying to write subsampling for the test set in the meantime.

x-tabdeveloping avatar Apr 29 '24 09:04 x-tabdeveloping

@KennethEnevoldsen performance is absolutely crap on the task with both models, especially on accuracy. Where do you think the issue may be?

x-tabdeveloping avatar May 03 '24 09:05 x-tabdeveloping

@x-tabdeveloping my guess is that the KNN (euclidian distance I assume) is a poor fit for models using cosine-sim. A solution is to use a different distance metric for KNN or implement a different classifier. I believe a 2 layer MLP is a good alternative.

KennethEnevoldsen avatar May 03 '24 09:05 KennethEnevoldsen

MLP would also be great for colinearity, though it might be slower and the task already takes quite a while to run (also the scikit-learn implementation is not exactly the fastest)

x-tabdeveloping avatar May 03 '24 09:05 x-tabdeveloping

I can give it a go.

x-tabdeveloping avatar May 03 '24 09:05 x-tabdeveloping

MLP would also be great for colinearity, though it might be slower and the task already takes quite a while to run (also the scikit-learn implementation is not exactly the fastest)

What are we talking? I don't believe classification should be limiting us here, but rather embedding time

KennethEnevoldsen avatar May 03 '24 10:05 KennethEnevoldsen

Running on UCloud again, should be able to submit results within a day.

x-tabdeveloping avatar May 03 '24 12:05 x-tabdeveloping

  1. It runs very slow, I couldn't complete the runs, maybe we should subsample and limit to the test set instead of the validation set and the test set.
  2. Performance is still crap, I have no idea what to do about that, and am at a bit of a loss as to what is happening.

x-tabdeveloping avatar May 08 '24 12:05 x-tabdeveloping

I made the neural network smaller and introduced stratified subsampling for the test set so that it runs faster, I will try to do a rerun.

x-tabdeveloping avatar May 08 '24 13:05 x-tabdeveloping

For what it's worth, maybe it might help to debug to use a small dataset.

isaac-chung avatar May 08 '24 14:05 isaac-chung

Yea using a smaller dataset for test seems like the right approach.

It runs very slow, I couldn't complete the runs, maybe we should subsample and limit to the test set instead of the validation set and the test set.

Hmm any idea about what part is slow? Is it simply running the trained model on the test set? (in which case reducing the test set might be an option)

Performance is still crap, I have no idea what to do about that, and am at a bit of a loss as to what is happening.

Doing a baseline using a logistic regression on each label is probably a good idea

KennethEnevoldsen avatar May 08 '24 19:05 KennethEnevoldsen

Something's not right with these scores, I will make a deep dive

x-tabdeveloping avatar May 09 '24 09:05 x-tabdeveloping

I ran EURLEX in English with all-MiniLM-L6 with multiple classifiers (MLPClassifier, KNN, DummyClassifier). It would seem that the task is simply incredibly hard, and that accuracy is not exactly a good metric to reflect performance, maybe we should make lrap the main score. Also note that kNN outperforms MLP by quite a bit, I think this is mainly because of the very small training set and the model is quite parameter-rich.

My suggestion is that we roll back to kNN and make LRAP the main score, what do you think @KennethEnevoldsen ?

{
    "en": {
      "dummy": {
        "accuracy": 0.0,
        "f1": 0.0,
        "lrap": 0.17113333333332317
      },
      "knn": {
        "accuracy": 0.0396,
        "f1": 0.29540945816583636,
        "lrap": 0.4267690714285629
      },
      "mlp": {
        "accuracy": 0.0082,
        "f1": 0.08189335124049107,
        "lrap": 0.2942032142856986
      }
    },
    "evaluation_time": 270.71
  }

x-tabdeveloping avatar May 09 '24 13:05 x-tabdeveloping

Also including Dummy classifier scores gives us a relatively good idea of chance level in this multilabel case.

x-tabdeveloping avatar May 09 '24 13:05 x-tabdeveloping

I would not include it in the task, but it might be interesting to just have a "random" model as a baseline.

  • A couple of thoughts. It might be worth increasing the training set size for the MLP
    • It might be fine with just KNN, alternatively we can do KNN + MLP and take the best (similar to clf)
  • It might be worth getting performance scores for subcategories (though in this case, it is 100+ right?)
  • I would also like an experiment using the base e5 just to see that larger models actually perform better

KennethEnevoldsen avatar May 09 '24 17:05 KennethEnevoldsen

E5 definitely performs better on the task than paraphrase-multilingual. I'm not sure about the subcategories, might be a bit too much for some tasks. Though we could include it if need be. In my experiments kNN uniformly performs better even with larger training set sizes. I suppose if it grows even larger it would surpass kNN, but we're already fighting performance issues with the benchmark, I think the less we have to embed the better.

x-tabdeveloping avatar May 09 '24 18:05 x-tabdeveloping

Also specific tasks are free to use whatever they want, like if you see an MLP more fit you can specify it in the task. What are your thoughts on the PR right now @KennethEnevoldsen ? Should we merge or is there something that still should be addressed

x-tabdeveloping avatar May 10 '24 10:05 x-tabdeveloping

I believe it is fine to merge

KennethEnevoldsen avatar May 11 '24 10:05 KennethEnevoldsen