mteb icon indicating copy to clipboard operation
mteb copied to clipboard

New fast loader for Bitext Mining parallel corpora

Open loicmagne opened this issue 9 months ago • 4 comments

Following https://github.com/embeddings-benchmark/mteb/issues/530 and https://github.com/embeddings-benchmark/mteb/pull/635, 14 datasets have been converted to a format where they can be loaded quickly, where it used to take minutes

There are still 6 datasets that are very slow to load:

  • BibleNLPBitextMining
  • BUCCBitextMining
  • FloresBitextMining
  • IN22ConvBitextMining
  • IN22GenBitextMining
  • NTREXBitextMining

Those are all Bitext mining tasks. They have a particular format which makes them non-trivial to adapt to the fast format: instead of having 1 file per subset (i.e. per language pair), they have 1 file for each language, and then the pairs are created on the fly. For example, the Flores dataset has ~1000 sentences for each language, and then to get the lang1-lang2 subset you simply take the sentences of both languages, the sentence matching is defined by the order in the file, e.g. the first sentence of lang1 is the translation of the first sentence of lang2, etc.

We could pre-generate every pair, but that would duplicate tons of data (Flores has 204 languages = ~40k pairs) and additionally there's currently an issue in the datasets lib when you load too many files (see https://github.com/huggingface/datasets/issues/6877)

My solution would be to create a data loader for "Parallel Corpus", that downloads the individual language data, and do the pairing as a data processing step. The speedup will come from loading the individual language data in one go instead of iteratively

By the way, this makes me realize that the way those datasets are evaluated is probably inefficient: to evaluate the pairs en-fr and en-es, the "en" sentences will be embedded twice while they're the same sentence. I don't know how costly it would be to cache the embedding but it might be something to think about

Let me know what you think, and if you have any suggestions

loicmagne avatar May 08 '24 14:05 loicmagne

@loicmagne it might be possible to derive a more reasonable format where we allow for more than two language (e.g. en, es, fr), the embed all at once and compare all pair combinations. This would avoid having a cache and avoid embedding documents twice.

You would have to supply some sort of column annotation {"en": "eng-Latn", ... and then rewrite the evaluation script to allow for multiple columns. If not column annotations are specified it will simply assume that the first column is lang1 and the second is lang2 (which is the current case).

KennethEnevoldsen avatar May 08 '24 19:05 KennethEnevoldsen

By the way, this makes me realize that the way those datasets are evaluated is probably inefficient: to evaluate the pairs en-fr and en-es, the "en" sentences will be embedded twice while they're the same sentence.

Adding to this, the current way of evaluating Flores would mean each lang's sentence will be embedded 204 times in total.

There are 2 things I noticed that I would love some feedback on.

1) Lang pairs

One observation is that the BibleNLPBitextMining only has eng-"lang" pairs, where lang is one of the 828, and this results in 1656 pairs. If we do the same for Flores, we'd only have 408 pairs, but not sure if we want to exclude the rest like this. Conversely, if we want to extend the lang pairs of the Bible dataset to include all 640K pairs.

2) Load all and embed at once

If we load all langs, say for Flores, then embed them (say with intfloat/multilingual-e5-small) whilst keeping them in memory, then we need 204 langs * 1000 sentences * 384 dims * float32bits = ~300MB. That's not too bad. For the Bible dataset it would be ~4x that, so just above 1GB, which most machines should be able to handle. To workaround the "too many files" issue, we might need to use streaming until there's a fix in datasets.

From this I'd say that a cache is likely not needed, and we could encode all sentences at once.

isaac-chung avatar May 09 '24 08:05 isaac-chung

Thanks for the feedback @isaac-chung @KennethEnevoldsen I think it makes a lot of sense to evaluate multiple languages at once, and it wouldn't be so much work to change the evaluator for that

I don't know about restricting to eng-"lang" pairs, I agree evaluating all n² pairs feels redundant, but we'd need to run tests to see if performances on A-B pair and A-C pair correlates with performances on B-C

loicmagne avatar May 09 '24 21:05 loicmagne

I don't know about restricting to eng-"lang" pairs, I agree evaluating all n² pairs feels redundant, but we'd need to run tests to see if performances on A-B pair and A-C pair correlates with performances on B-C

One solution to this (at least to keep the same behavior as now) is just to optionally specify a list of pairs.

KennethEnevoldsen avatar May 09 '24 21:05 KennethEnevoldsen

Implemented in https://github.com/embeddings-benchmark/mteb/pull/635

loicmagne avatar May 17 '24 13:05 loicmagne