mteb icon indicating copy to clipboard operation
mteb copied to clipboard

fix: Convert Multilingual/Crosslingual to fast-loading format

Open loicmagne opened this issue 9 months ago • 12 comments

Following https://github.com/embeddings-benchmark/mteb/issues/530, https://github.com/embeddings-benchmark/mteb/pull/572

The goal of this PR is to modify multilingual/crosslingual datasets to the fast loading format, i.e. where each row in the dataset has an additional "lang" feature. I don't know of an automatic way to do this, so for now I'm updating each dataset on a case by case basis, which is a bit tedious

List of datasets converted/to convert:

STS:

  • [X] STS17Crosslingual
  • [X] STS22CrosslingualSTS
  • [X] IndicCrosslingualSTS
  • [x] STSBenchmarkMultilingualSTS

Pair classification:

  • [x] XNLI

Bitext Mining:

  • [X] TatoebaBitextMining
  • [X] BUCCBitextMining
  • [X] FloresBitextMining
  • [X] IN22ConvBitextMining
  • [X] IN22GenBitextMining
  • [X] NTREXBitextMining

🚧 BibleNLPBitextMining

Classification

  • [x] IndicSentimentClassification
  • [X] MultiHateClassification
  • [X] MultilingualSentimentClassification
  • [X] TweetSentimentClassification
  • [x] MasakhaNEWSClassification
  • [x] SIB200Classification
  • [x] MassiveIntentClassification
  • [x] MassiveScenarioClassification

Those are the datasets with >10 subsets, which would benefit the most from fast loading

loicmagne avatar May 05 '24 16:05 loicmagne

One of the issue to convert existing datasets is that several of them use custom loading scripts, which makes it non trivial to convert to the fast format

For example Flores, NTREX or IN22-Conv don't explicitly defines subsets for each language pairs, they contain data for each languages and then the pairs are created on the fly (so basically the 'en-fr' subset and the 'en-es' one share the same 'en' sentences). Not sure what the correct way to handle this would be. Converting those dataset to the standard format "1 file per subset" would duplicate a lot of data, but having different configurations for each datasets is hard to maintain

loicmagne avatar May 06 '24 14:05 loicmagne

@loicmagne, let us to the easy ones in this PR and then discuss how to best handle the rest.

KennethEnevoldsen avatar May 07 '24 08:05 KennethEnevoldsen

I've managed to convert most multilingual datasets from the STS, Pair classification and Classification category, at least those with more than 10 subsets

The remaining datasets are in the Bitext mining category, and are not straightforward to convert. They either have a different format, or have too many files to be loaded in one go (see this issue https://github.com/huggingface/datasets/issues/6877 )

I suggest we merge this PR and discuss how to handle the remaining datasets ? @KennethEnevoldsen

loicmagne avatar May 07 '24 22:05 loicmagne

I checked that all the results remains the same within a 1e-4 threshold, although I don't really know why they sometimes vary slightly

loicmagne avatar May 07 '24 22:05 loicmagne

I don't really know why they sometimes vary slightly

My guess is that it is to do with a single place where the calculations change pr. run influencing the seed (a solution is to use an rng_state which is passed along, but not influence by other operations as it e.g. done in #481). For a related blogpost you might want to check out: https://builtin.com/data-science/numpy-random-seed

I think that is a seperate PR though. I think merging this is a great idea.

Will you add points (bug fixes where you add one point pr. dataset seems reasonable?). Will you also create an issue for the remaining datasets not addressed here.

KennethEnevoldsen avatar May 08 '24 11:05 KennethEnevoldsen

Sounds good I'm writing the issue for the remaining datasets

loicmagne avatar May 08 '24 12:05 loicmagne

Issue opened here for the remaining datasets: https://github.com/embeddings-benchmark/mteb/issues/651

loicmagne avatar May 08 '24 14:05 loicmagne

@KennethEnevoldsen Following https://github.com/embeddings-benchmark/mteb/issues/651 I converted the remaining datasets to a compact format and changed the BitextMiningEvaluator accordingly to handle multiple languages

I think this makes those datasets usable now, loading and running evaluation on Flores on the 42k language pairs now takes <10 minutes on small models

The last remaining dataset BibleNLPBitextMining relies on a fix of the datasets lib https://github.com/huggingface/datasets/pull/6893 which isn't in the latest release yet so I'll wait for that

loicmagne avatar May 14 '24 22:05 loicmagne

@loicmagne once we have resolved the question related to BUCC I believe we can merge this

KennethEnevoldsen avatar May 15 '24 09:05 KennethEnevoldsen

@loicmagne once we have resolved the question related to BUCC I believe we can merge this

@KennethEnevoldsen For the BUCC dataset, you can see the new results in the BUCC.json file. Overall there's a ~1% point difference, I can revert the changes if that's a problem but it simplifies the code greatly

How did you proceed for other tasks (clustering I think) where the changes created non-backward compatible results ?

loicmagne avatar May 15 '24 09:05 loicmagne

@loicmagne what we have done is keep the old implementation (which you might do here by moving to logic over to the BUCC task) and then add a superseeded_by = "new_task_name" (see e.g. #694) to the old task. Then we can still run the old task, but it will raise a warning stating that X supersedes it.

I believe this is the best approach here as well.

KennethEnevoldsen avatar May 15 '24 11:05 KennethEnevoldsen

@KennethEnevoldsen alright I added back the previous BUCC version and named the new one BUCC.v2. There's almost a 20x evaluation time difference between the two so I think it's worth having the new version

I think we can merge then

loicmagne avatar May 15 '24 13:05 loicmagne

@KennethEnevoldsen Can we merge this?

loicmagne avatar May 17 '24 10:05 loicmagne

Yes indeed - thanks again @loicmagne! I have enabled automerge

@KennethEnevoldsen alright I added back the previous BUCC version and named the new one BUCC.v2. There's almost a 20x evaluation time difference between the two so I think it's worth having the new version

Sorry was out yesterday so didn't have time to look at this before now. 20x def. seems like a good reason to introduce v2

KennethEnevoldsen avatar May 17 '24 10:05 KennethEnevoldsen