mteb icon indicating copy to clipboard operation
mteb copied to clipboard

Add OpenSubtitles Bitext Mining dataset

Open loicmagne opened this issue 10 months ago • 15 comments

Hey, I created a bitext mining task from the OpenSubtitles dataset, which is derived from translations of movie subtitles. I think it could be a great addition as it contains "spoken" sentences which isn't really covered in the benchmark yet I think, and it contains a lot of under-represented languages

However, the dataset is pretty huge, it contains 1759 language pairs, for each language pairs I sampled 1000 sentences randomly. The resulting dataset isn't so big, ~300MB, but loading 1759 splits sequentially with load_dataset takes some time

Let me know wdyt @KennethEnevoldsen @Muennighoff , I'm aware that there are already concerns with the size of some tasks in the benchmark, should I try to reduce the size ?

Also I used the language codes from the original dataset but I can change them if needed

Checklist for adding MMTEB dataset

  • [x] I have tested that the dataset runs with the mteb package.
  • [x] I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • [x] sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • [x] intfloat/multilingual-e5-small
  • [x] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • [x] I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks)
  • [x] Run tests locally to make sure nothing is broken using make test.
  • [x] Run the formatter to format the code using make lint.
  • [ ] I have added points for my submission to the POINTS.md file.

loicmagne avatar Apr 09 '24 11:04 loicmagne

Ok I'll run both models and see how I can reduce the size

loicmagne avatar Apr 09 '24 14:04 loicmagne

I ran both models on the original dataset, based on the results I kept the 250 language pairs with the lowest scores, excluding some very low score that were due to poor data pairing and parsing in the original subtitles

The dataset now has 250k sentences in total but they are pretty short (28 char average) so I think it's better to keep 1000 sentences per language

perfs

loicmagne avatar Apr 10 '24 15:04 loicmagne

Wonderful @loicmagne, this looks good. Is this plot using all the available samples?

In general, I would be more interested in wider coverage than many examples. So 250 sentences pr. 1000 languages are better than 1000 sentences for 250 languages.

One way to see if this influences performance too much is to do the same plot using 250 and 1000 examples pr. language. We are naturally interested in keeping it as small as possible.

KennethEnevoldsen avatar Apr 10 '24 17:04 KennethEnevoldsen

I see, the reason I tried to reduce the number of language pairs is that there is an overhead on the load_dataset method which makes loading a lot of dataset very slow, just loading the original 1700 language pairs would take 15+ minutes even with 100 sentences per language pairs

But I guess in the end the bottleneck for large models will be running inference on sentences so it makes sense to reduce the number of sentences, I'll check how significant are the performance changes when reducing the amount of samples

loicmagne avatar Apr 10 '24 17:04 loicmagne

Hmm, that seems odd. Is there an issue on datasets for this? The best I can find is https://github.com/huggingface/datasets/issues/5499, which seems to suggest it is 4s from loading a cached dataset (still leave us at a high number but better). How is the dataset stored in the cache? (hopefully as .arrow and not .jsonl)

KennethEnevoldsen avatar Apr 10 '24 18:04 KennethEnevoldsen

Hmm, that seems odd. Is there an issue on datasets for this? The best I can find is huggingface/datasets#5499, which seems to suggest it is 4s from loading a cached dataset (still leave us at a high number but better). How is the dataset stored in the cache? (hopefully as .arrow and not .jsonl)

I just checked on a different computer for this dataset https://huggingface.co/datasets/loicmagne/open-subtitles-250-bitext-mining (250 pairs of languages, 1000 sentences each), it takes 15min to load all the subsets from the HF hub, and 12min to load it from the cache, so yeah there's a crazy overhead, I guess I'll open an issue on the datasets repo

loicmagne avatar Apr 10 '24 19:04 loicmagne

If you examine the ~/.cache/huggingface folder, can you see how it is stored?

KennethEnevoldsen avatar Apr 11 '24 07:04 KennethEnevoldsen

related to #342

KennethEnevoldsen avatar Apr 11 '24 09:04 KennethEnevoldsen

If you examine the ~/.cache/huggingface folder, can you see how it is stored?

Yeah it's stored as .arrow in the cache

loicmagne avatar Apr 11 '24 09:04 loicmagne

Hmm that is odd. def. file an issue on datasets to see if there is a reason for this.

KennethEnevoldsen avatar Apr 11 '24 09:04 KennethEnevoldsen

Ok I got the results for 1000 sentences per language pairs vs 256 sentences. The performance are strongly correlated even though the task is a bit easier with 256 sentences since there's less sentences to match with, I think 256 sentences is fine. The dataset doesn't look trivial either for most language pairs

256_1000s

Still, the loading time is currently very high even when reducing the number of language pairs, I opened an issue on the datasets repo https://github.com/huggingface/datasets/issues/6800 , I suggest we wait for responses before going further with this dataset

loicmagne avatar Apr 12 '24 07:04 loicmagne

Thanks for working on this @loicmagne, I think it is a good idea to wait as well as it might influence how we cut det data.

KennethEnevoldsen avatar Apr 12 '24 09:04 KennethEnevoldsen

@loicmagne, should we do something about this PR? (either fewer languages or the solution you suggested in the PR)

related to: https://github.com/huggingface/datasets/issues/6800

KennethEnevoldsen avatar Apr 23 '24 13:04 KennethEnevoldsen

@loicmagne, should we do something about this PR? (either fewer languages or the solution you suggested in the PR)

related to: huggingface/datasets#6800

Yeah I was waiting to see if there was a better solution than the one I proposed

I would find it frustrating to reduce the number of language pairs, as this discards a lot of data. I'm still running experiments to find a good compromise

loicmagne avatar Apr 23 '24 14:04 loicmagne

Yea, me to. Glad to hear that you are working on it. This would be a great speed-up at almost no cost.

KennethEnevoldsen avatar Apr 23 '24 14:04 KennethEnevoldsen

Closing for now I'm not sure this dataset would be very useful

loicmagne avatar May 15 '24 13:05 loicmagne