datasets High overhead when loading lots of subsets from the same dataset

Describe the bug

I have a multilingual dataset that contains a lot of subsets. Each subset corresponds to a pair of languages, you can see here an example with 250 subsets: https://hf.co/datasets/loicmagne/open-subtitles-250-bitext-mining. As part of the MTEB benchmark, we may need to load all the subsets of the dataset. The dataset is relatively small and contains only ~45MB of data, but when I try to load every subset, it takes 15 minutes from the HF hub and 13 minutes from the cache

This issue https://github.com/huggingface/datasets/issues/5499 also referenced this overhead, but I'm wondering if there is anything I can do to speedup loading different subsets of the same dataset, both when loading from disk and from the HF hub? Currently each subset is stored in a jsonl file

Steps to reproduce the bug

from datasets import load_dataset

for subset in ['ka-ml', 'br-sr', 'bg-br', 'kk-lv', 'br-sk', 'br-fi', 'eu-ze_zh', 'kk-nl', 'kk-vi', 'ja-kk', 'br-sv', 'kk-zh_cn', 'kk-ms', 'br-et', 'br-hu', 'eo-kk', 'br-tr', 'ko-tl', 'te-zh_tw', 'br-hr', 'br-nl', 'ka-si', 'br-cs', 'br-is', 'br-ro', 'br-de', 'et-kk', 'fr-hy', 'br-no', 'is-ko', 'br-da', 'br-en', 'eo-lt', 'is-ze_zh', 'eu-ko', 'br-it', 'br-id', 'eu-zh_cn', 'is-ja', 'br-sl', 'br-gl', 'br-pt_br', 'br-es', 'br-pt', 'is-th', 'fa-is', 'br-ca', 'eu-ka', 'is-zh_cn', 'eu-ur', 'id-kk', 'br-sq', 'eu-ja', 'uk-ur', 'is-zh_tw', 'ka-ko', 'eu-zh_tw', 'eu-th', 'eu-is', 'is-tl', 'br-eo', 'eo-ze_zh', 'eu-te', 'ar-kk', 'eo-lv', 'ko-ze_zh', 'ml-ze_zh', 'is-lt', 'br-fr', 'ko-te', 'kk-sl', 'eu-fa', 'eo-ko', 'ka-ze_en', 'eo-eu', 'ta-zh_tw', 'eu-lv', 'ko-lv', 'lt-tl', 'eu-si', 'hy-ru', 'ar-is', 'eu-lt', 'eu-tl', 'eu-uk', 'ka-ze_zh', 'si-ze_zh', 'el-is', 'bn-is', 'ko-ze_en', 'eo-si', 'cs-kk', 'is-uk', 'eu-ze_en', 'ta-ze_zh', 'is-pl', 'is-mk', 'eu-ta', 'ko-lt', 'is-lv', 'fa-ko', 'bn-ko', 'hi-is', 'bn-ze_zh', 'bn-eu', 'bn-ja', 'is-ml', 'eu-ru', 'ko-ta', 'is-vi', 'ja-tl', 'eu-mk', 'eu-he', 'ka-zh_tw', 'ka-zh_cn', 'si-tl', 'is-kk', 'eu-fi', 'fi-ko', 'is-ur', 'ka-th', 'ko-ur', 'eo-ja', 'he-is', 'is-tr', 'ka-ur', 'et-ko', 'eu-vi', 'is-sk', 'gl-is', 'fr-is', 'is-sq', 'hu-is', 'fr-kk', 'eu-sq', 'is-ru', 'ja-ka', 'fi-tl', 'ka-lv', 'fi-is', 'is-si', 'ar-ko', 'ko-sl', 'ar-eu', 'ko-si', 'bg-is', 'eu-hu', 'ko-sv', 'bn-hu', 'kk-ro', 'eu-hi', 'ka-ms', 'ko-th', 'ko-sr', 'ko-mk', 'fi-kk', 'ka-vi', 'eu-ml', 'ko-ml', 'de-ko', 'fa-ze_zh', 'eu-sk', 'is-sl', 'et-is', 'eo-is', 'is-sr', 'is-ze_en', 'kk-pt_br', 'hr-hy', 'kk-pl', 'ja-ta', 'is-ms', 'hi-ze_en', 'is-ro', 'ko-zh_cn', 'el-eu', 'ka-pl', 'ka-sq', 'eu-sl', 'fa-ka', 'ko-no', 'si-ze_en', 'ko-uk', 'ja-ze_zh', 'hu-ko', 'kk-no', 'eu-pl', 'is-pt_br', 'bn-lv', 'tl-zh_cn', 'is-nl', 'he-ko', 'ko-sq', 'ta-th', 'lt-ta', 'da-ko', 'ca-is', 'is-ta', 'bn-fi', 'ja-ml', 'lv-si', 'eu-sv', 'ja-te', 'bn-ur', 'bn-ca', 'bs-ko', 'bs-is', 'eu-sr', 'ko-vi', 'ko-zh_tw', 'et-tl', 'kk-tr', 'eo-vi', 'is-it', 'ja-ko', 'eo-et', 'id-is', 'bn-et', 'bs-eu', 'bn-lt', 'tl-uk', 'bn-zh_tw', 'da-eu', 'el-ko', 'no-tl', 'ko-sk', 'is-pt', 'hu-kk', 'si-zh_tw', 'si-te', 'ka-ru', 'lt-ml', 'af-ja', 'bg-eu', 'eo-th', 'cs-is', 'pl-ze_zh', 'el-kk', 'kk-sv', 'ka-nl', 'ko-pl', 'bg-ko', 'ka-pt_br', 'et-eu', 'tl-zh_tw', 'ka-pt', 'id-ko', 'fi-ze_zh', 'he-kk', 'ka-tr']:
    load_dataset('loicmagne/open-subtitles-250-bitext-mining', subset)

Expected behavior

Faster loading?

Environment info

Copy-and-paste the text below in your GitHub issue.

datasets version: 2.18.0
Platform: Linux-6.5.0-27-generic-x86_64-with-glibc2.35
Python version: 3.10.12
huggingface_hub version: 0.22.2
PyArrow version: 15.0.2
Pandas version: 2.2.2
fsspec version: 2023.5.0

Apr 10 '24 21:04 loicmagne

Hi !

It's possible to multiple files at once:

data_files = "data/*.jsonl"
# Or pass a list of files
langs = ['ka-ml', 'br-sr', 'ka-pt', 'id-ko', ..., 'fi-ze_zh', 'he-kk', 'ka-tr']
data_files = [f"data/{lang}.jsonl" for lang in langs]
ds = load_dataset("loicmagne/open-subtitles-250-bitext-mining", data_files=data_files, split="train")

Also maybe you can add a subset called "all" for people that want to load all the data without having to list all the languages ?

  - config_name: all
    data_files: data/*.jsonl

Apr 15 '24 16:04 lhoestq

Thanks for your reply, it is indeed much faster, however the result is a dataset where all the subsets are "merged" together, the language pair is lost:

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2'],
        num_rows: 247809
    })
})

I guess I could add a 'lang' feature for each row in the dataset, is there a better way to do it ?

Apr 15 '24 18:04 loicmagne

Hi @lhoestq over at https://github.com/embeddings-benchmark/mteb/issues/530 we have started examining these issues and would love to make a PR for datasets if we believe there is a way to improve the speed. As I assume you have a better overview than me @lhoestq, would you be interested in a PR, and might you have an idea about where we would start working on it?

We see a speed comparison of

15 minutes (for ~20% of the languages) when loaded using a for loop
17 minutes using the your suggestion
~30 seconds when using @loicmagne "merged" method.

Worth mentioning is that solution 2 looses the language information.

Apr 24 '24 07:04 KennethEnevoldsen

Can you retry using datasets 2.19 ? We improved a lot the speed of downloading datasets with tons of small files.

pip install -U datasets

Now this takes 17sec on my side instead of the 17min minutes @loicmagne mentioned :)

>>> %time ds = load_dataset("loicmagne/open-subtitles-250-bitext-mining", data_files="data/*.jsonl")
Downloading readme: 100%|█████████████████████████████████| 13.7k/13.7k [00:00<00:00, 5.47MB/s]
Resolving data files: 100%|█████████████████████████████████| 250/250 [00:00<00:00, 612.51it/s]
Downloading data: 100%|██████████████████████████████████| 250/250 [00:12<00:00, 19.68files/s]
Generating train split: 247809 examples [00:00, 1057071.08 examples/s]
CPU times: user 4.95 s, sys: 3.1 s, total: 8.05 s
Wall time: 17.4 s

Apr 24 '24 13:04 lhoestq

Can you retry using datasets 2.19 ? We improved a lot the speed of downloading datasets with tons of small files.

pip install -U datasets

Now this takes 17sec on my side instead of the 17min minutes @loicmagne mentioned :)

>>> %time ds = load_dataset("loicmagne/open-subtitles-250-bitext-mining", data_files="data/*.jsonl")
Downloading readme: 100%|█████████████████████████████████| 13.7k/13.7k [00:00<00:00, 5.47MB/s]
Resolving data files: 100%|█████████████████████████████████| 250/250 [00:00<00:00, 612.51it/s]
Downloading data: 100%|██████████████████████████████████| 250/250 [00:12<00:00, 19.68files/s]
Generating train split: 247809 examples [00:00, 1057071.08 examples/s]
CPU times: user 4.95 s, sys: 3.1 s, total: 8.05 s
Wall time: 17.4 s

I was actually just noticing that, I bumped from 2.18 to 2.19 and got a massive speedup, amazing!

About the fact that subset names are lost when loading all files at once, currently my solution is to add a 'lang' feature to each rows, convert to polars and use:

ds_split = ds.to_polars().group_by('lang')

It's fast so I think it's an acceptable solution, but is there a better way to do it ?

Apr 24 '24 13:04 loicmagne

It's the fastest way I think :)

Alternatively you can download the dataset repository locally using huggingface_hub (either via CLI or in python) and load the subsets one by one locally using a for loop as you were doing before (just pass the directory path to load_dataset instead of the dataset_id).

Apr 24 '24 13:04 lhoestq

datasets datasets copied to clipboard

High overhead when loading lots of subsets from the same dataset

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

datasets
datasets copied to clipboard