mteb
mteb copied to clipboard
Add OpenSubtitles Bitext Mining dataset
Hey, I created a bitext mining task from the OpenSubtitles dataset, which is derived from translations of movie subtitles. I think it could be a great addition as it contains "spoken" sentences which isn't really covered in the benchmark yet I think, and it contains a lot of under-represented languages
However, the dataset is pretty huge, it contains 1759 language pairs, for each language pairs I sampled 1000 sentences randomly. The resulting dataset isn't so big, ~300MB, but loading 1759 splits sequentially with load_dataset takes some time
Let me know wdyt @KennethEnevoldsen @Muennighoff , I'm aware that there are already concerns with the size of some tasks in the benchmark, should I try to reduce the size ?
Also I used the language codes from the original dataset but I can change them if needed
Checklist for adding MMTEB dataset
- [x] I have tested that the dataset runs with the
mteb
package. - [x] I have run the following models on the task (adding the results to the pr). These can be run using the
mteb run -m {model_name} -t {task_name}
command.- [x]
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- [x]
intfloat/multilingual-e5-small
- [x]
- [x] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
- [x] I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks)
- [x] Run tests locally to make sure nothing is broken using
make test
. - [x] Run the formatter to format the code using
make lint
. - [ ] I have added points for my submission to the POINTS.md file.
Ok I'll run both models and see how I can reduce the size
I ran both models on the original dataset, based on the results I kept the 250 language pairs with the lowest scores, excluding some very low score that were due to poor data pairing and parsing in the original subtitles
The dataset now has 250k sentences in total but they are pretty short (28 char average) so I think it's better to keep 1000 sentences per language
Wonderful @loicmagne, this looks good. Is this plot using all the available samples?
In general, I would be more interested in wider coverage than many examples. So 250 sentences pr. 1000 languages are better than 1000 sentences for 250 languages.
One way to see if this influences performance too much is to do the same plot using 250 and 1000 examples pr. language. We are naturally interested in keeping it as small as possible.
I see, the reason I tried to reduce the number of language pairs is that there is an overhead on the load_dataset method which makes loading a lot of dataset very slow, just loading the original 1700 language pairs would take 15+ minutes even with 100 sentences per language pairs
But I guess in the end the bottleneck for large models will be running inference on sentences so it makes sense to reduce the number of sentences, I'll check how significant are the performance changes when reducing the amount of samples
Hmm, that seems odd. Is there an issue on datasets for this? The best I can find is https://github.com/huggingface/datasets/issues/5499, which seems to suggest it is 4s from loading a cached dataset (still leave us at a high number but better). How is the dataset stored in the cache? (hopefully as .arrow and not .jsonl)
Hmm, that seems odd. Is there an issue on datasets for this? The best I can find is huggingface/datasets#5499, which seems to suggest it is 4s from loading a cached dataset (still leave us at a high number but better). How is the dataset stored in the cache? (hopefully as .arrow and not .jsonl)
I just checked on a different computer for this dataset https://huggingface.co/datasets/loicmagne/open-subtitles-250-bitext-mining (250 pairs of languages, 1000 sentences each), it takes 15min to load all the subsets from the HF hub, and 12min to load it from the cache, so yeah there's a crazy overhead, I guess I'll open an issue on the datasets repo
If you examine the ~/.cache/huggingface folder, can you see how it is stored?
related to #342
If you examine the ~/.cache/huggingface folder, can you see how it is stored?
Yeah it's stored as .arrow in the cache
Hmm that is odd. def. file an issue on datasets to see if there is a reason for this.
Ok I got the results for 1000 sentences per language pairs vs 256 sentences. The performance are strongly correlated even though the task is a bit easier with 256 sentences since there's less sentences to match with, I think 256 sentences is fine. The dataset doesn't look trivial either for most language pairs
Still, the loading time is currently very high even when reducing the number of language pairs, I opened an issue on the datasets repo https://github.com/huggingface/datasets/issues/6800 , I suggest we wait for responses before going further with this dataset
Thanks for working on this @loicmagne, I think it is a good idea to wait as well as it might influence how we cut det data.
@loicmagne, should we do something about this PR? (either fewer languages or the solution you suggested in the PR)
related to: https://github.com/huggingface/datasets/issues/6800
@loicmagne, should we do something about this PR? (either fewer languages or the solution you suggested in the PR)
related to: huggingface/datasets#6800
Yeah I was waiting to see if there was a better solution than the one I proposed
I would find it frustrating to reduce the number of language pairs, as this discards a lot of data. I'm still running experiments to find a good compromise
Yea, me to. Glad to hear that you are working on it. This would be a great speed-up at almost no cost.
Closing for now I'm not sure this dataset would be very useful