mteb icon indicating copy to clipboard operation
mteb copied to clipboard

MalteseNewsClassification added

Open dokato opened this issue 10 months ago • 4 comments

Maltese news classification dataset: https://huggingface.co/datasets/MLRS/maltese_news_categories

As suggested here: https://github.com/embeddings-benchmark/mteb/issues/419

Checklist for adding MMTEB dataset

Reason for dataset addition:

  • [x] I have tested that the dataset runs with the mteb package.
  • [x] I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • [x] sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • [x] intfloat/multilingual-e5-small
  • [x] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • [x] If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • [x] I have filled out the metadata object in the dataset file (find documentation on it here).
  • [x] Run tests locally to make sure nothing is broken using make test.
  • [x] Run the formatter to format the code using make lint.
  • [ ] I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

dokato avatar Apr 24 '24 10:04 dokato

Yup, that makes, sense, thanks @isaac-chung , I'll keep an eye on this pr!

dokato avatar Apr 24 '24 12:04 dokato

It's merged :)

isaac-chung avatar May 15 '24 20:05 isaac-chung

thanks @isaac-chung , yeah I saw, but as this is quite big, we're still trying to figure how to do stratified sampling for multilabel problem, see: https://github.com/embeddings-benchmark/mteb/discussions/698 I proposed here my take on that, until I'll get a green light or alternative, let's keep it WIP.

dokato avatar May 15 '24 21:05 dokato

@dokato Let's keep this in a separate PR so we can discuss the two things on different threads (this task and then stratified subsampling). It should also be fine considering that a) all of these models will be rerun anyway b) we already have a function that does this, maybe faulty, but the interface already exists and we can swap it out anytime

x-tabdeveloping avatar May 16 '24 07:05 x-tabdeveloping

K, done that! Created new PR in https://github.com/embeddings-benchmark/mteb/pull/760

This should be ready @isaac-chung. I "unwiped" it.

dokato avatar May 17 '24 18:05 dokato