mteb icon indicating copy to clipboard operation
mteb copied to clipboard

Converted PLSC to hierarchical

Open x-tabdeveloping opened this issue 9 months ago • 3 comments

Checklist for adding MMTEB dataset

Reason for dataset addition: Converted both PLSC tasks (S2S, P2P) to hierarchical clustering. #702

  • [x] I have tested that the dataset runs with the mteb package.
  • [x] I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • [x] sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • [x] intfloat/multilingual-e5-small
  • [x] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • [x] If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • [x] I have filled out the metadata object in the dataset file (find documentation on it here).
  • [x] Run tests locally to make sure nothing is broken using make test.
  • [x] Run the formatter to format the code using make lint.
  • [ ] I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

x-tabdeveloping avatar May 14 '24 13:05 x-tabdeveloping

The later levels seem very hard. Maybe we should limit the levels to two?

x-tabdeveloping avatar May 14 '24 13:05 x-tabdeveloping

I'm not sure whether the way I formulated the task makes sense. @rafalposwiata You added the dataset initially, therefore you might know: Is the "disciplines" column hierarchically ordered or just multilabel? Or could "scientific_fields" be used as the first level and "disciplines" as the second? What's your take on this?

x-tabdeveloping avatar May 14 '24 13:05 x-tabdeveloping

Is the "disciplines" column hierarchically ordered or just multilabel?

Disciplines are multilabel but for the added clustering tasks I chose only those cases where there is one discipline.

Or could "scientific_fields" be used as the first level and "disciplines" as the second?

Yes, "scientific_fields" could be used as the first level and "disciplines" as the second.

The entire dataset is available at https://huggingface.co/datasets/rafalposwiata/plsc

rafalposwiata avatar May 14 '24 19:05 rafalposwiata

@x-tabdeveloping will you add points for this then I believe it is ready to merge

KennethEnevoldsen avatar May 21 '24 09:05 KennethEnevoldsen

I'm not sure though. The task formulation might be wrong. I think doing "scientific_fields" as first level and "disciplines" as the second might be the way to go. From what I've gathered it seems that this is just multilabel, not hierarchical the way I formulated it, right @rafalposwiata ?

x-tabdeveloping avatar May 21 '24 12:05 x-tabdeveloping

@x-tabdeveloping but the current approach is fine with that right? As I understand it is just does the clustering at each level?

KennethEnevoldsen avatar May 21 '24 12:05 KennethEnevoldsen

Yes, unless the order is not fixed, and I don't know if it is (we have to check)

x-tabdeveloping avatar May 21 '24 13:05 x-tabdeveloping

Right. Once checked we can either close or merge

KennethEnevoldsen avatar May 21 '24 14:05 KennethEnevoldsen

Nope, it's not hierarchical at all. We can maybe rephrase it as multilabel classification if we really want to, otherwise fine to leave it as flat clustering.

x-tabdeveloping avatar May 21 '24 14:05 x-tabdeveloping

Let us leave it as flat clustering

KennethEnevoldsen avatar May 27 '24 14:05 KennethEnevoldsen