datasets icon indicating copy to clipboard operation
datasets copied to clipboard

High overhead when loading lots of subsets from the same dataset

Open loicmagne opened this issue 10 months ago • 6 comments

Describe the bug

I have a multilingual dataset that contains a lot of subsets. Each subset corresponds to a pair of languages, you can see here an example with 250 subsets: https://hf.co/datasets/loicmagne/open-subtitles-250-bitext-mining. As part of the MTEB benchmark, we may need to load all the subsets of the dataset. The dataset is relatively small and contains only ~45MB of data, but when I try to load every subset, it takes 15 minutes from the HF hub and 13 minutes from the cache

This issue https://github.com/huggingface/datasets/issues/5499 also referenced this overhead, but I'm wondering if there is anything I can do to speedup loading different subsets of the same dataset, both when loading from disk and from the HF hub? Currently each subset is stored in a jsonl file

Steps to reproduce the bug

from datasets import load_dataset

for subset in ['ka-ml', 'br-sr', 'bg-br', 'kk-lv', 'br-sk', 'br-fi', 'eu-ze_zh', 'kk-nl', 'kk-vi', 'ja-kk', 'br-sv', 'kk-zh_cn', 'kk-ms', 'br-et', 'br-hu', 'eo-kk', 'br-tr', 'ko-tl', 'te-zh_tw', 'br-hr', 'br-nl', 'ka-si', 'br-cs', 'br-is', 'br-ro', 'br-de', 'et-kk', 'fr-hy', 'br-no', 'is-ko', 'br-da', 'br-en', 'eo-lt', 'is-ze_zh', 'eu-ko', 'br-it', 'br-id', 'eu-zh_cn', 'is-ja', 'br-sl', 'br-gl', 'br-pt_br', 'br-es', 'br-pt', 'is-th', 'fa-is', 'br-ca', 'eu-ka', 'is-zh_cn', 'eu-ur', 'id-kk', 'br-sq', 'eu-ja', 'uk-ur', 'is-zh_tw', 'ka-ko', 'eu-zh_tw', 'eu-th', 'eu-is', 'is-tl', 'br-eo', 'eo-ze_zh', 'eu-te', 'ar-kk', 'eo-lv', 'ko-ze_zh', 'ml-ze_zh', 'is-lt', 'br-fr', 'ko-te', 'kk-sl', 'eu-fa', 'eo-ko', 'ka-ze_en', 'eo-eu', 'ta-zh_tw', 'eu-lv', 'ko-lv', 'lt-tl', 'eu-si', 'hy-ru', 'ar-is', 'eu-lt', 'eu-tl', 'eu-uk', 'ka-ze_zh', 'si-ze_zh', 'el-is', 'bn-is', 'ko-ze_en', 'eo-si', 'cs-kk', 'is-uk', 'eu-ze_en', 'ta-ze_zh', 'is-pl', 'is-mk', 'eu-ta', 'ko-lt', 'is-lv', 'fa-ko', 'bn-ko', 'hi-is', 'bn-ze_zh', 'bn-eu', 'bn-ja', 'is-ml', 'eu-ru', 'ko-ta', 'is-vi', 'ja-tl', 'eu-mk', 'eu-he', 'ka-zh_tw', 'ka-zh_cn', 'si-tl', 'is-kk', 'eu-fi', 'fi-ko', 'is-ur', 'ka-th', 'ko-ur', 'eo-ja', 'he-is', 'is-tr', 'ka-ur', 'et-ko', 'eu-vi', 'is-sk', 'gl-is', 'fr-is', 'is-sq', 'hu-is', 'fr-kk', 'eu-sq', 'is-ru', 'ja-ka', 'fi-tl', 'ka-lv', 'fi-is', 'is-si', 'ar-ko', 'ko-sl', 'ar-eu', 'ko-si', 'bg-is', 'eu-hu', 'ko-sv', 'bn-hu', 'kk-ro', 'eu-hi', 'ka-ms', 'ko-th', 'ko-sr', 'ko-mk', 'fi-kk', 'ka-vi', 'eu-ml', 'ko-ml', 'de-ko', 'fa-ze_zh', 'eu-sk', 'is-sl', 'et-is', 'eo-is', 'is-sr', 'is-ze_en', 'kk-pt_br', 'hr-hy', 'kk-pl', 'ja-ta', 'is-ms', 'hi-ze_en', 'is-ro', 'ko-zh_cn', 'el-eu', 'ka-pl', 'ka-sq', 'eu-sl', 'fa-ka', 'ko-no', 'si-ze_en', 'ko-uk', 'ja-ze_zh', 'hu-ko', 'kk-no', 'eu-pl', 'is-pt_br', 'bn-lv', 'tl-zh_cn', 'is-nl', 'he-ko', 'ko-sq', 'ta-th', 'lt-ta', 'da-ko', 'ca-is', 'is-ta', 'bn-fi', 'ja-ml', 'lv-si', 'eu-sv', 'ja-te', 'bn-ur', 'bn-ca', 'bs-ko', 'bs-is', 'eu-sr', 'ko-vi', 'ko-zh_tw', 'et-tl', 'kk-tr', 'eo-vi', 'is-it', 'ja-ko', 'eo-et', 'id-is', 'bn-et', 'bs-eu', 'bn-lt', 'tl-uk', 'bn-zh_tw', 'da-eu', 'el-ko', 'no-tl', 'ko-sk', 'is-pt', 'hu-kk', 'si-zh_tw', 'si-te', 'ka-ru', 'lt-ml', 'af-ja', 'bg-eu', 'eo-th', 'cs-is', 'pl-ze_zh', 'el-kk', 'kk-sv', 'ka-nl', 'ko-pl', 'bg-ko', 'ka-pt_br', 'et-eu', 'tl-zh_tw', 'ka-pt', 'id-ko', 'fi-ze_zh', 'he-kk', 'ka-tr']:
    load_dataset('loicmagne/open-subtitles-250-bitext-mining', subset)

Expected behavior

Faster loading?

Environment info

Copy-and-paste the text below in your GitHub issue.

  • datasets version: 2.18.0
  • Platform: Linux-6.5.0-27-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • huggingface_hub version: 0.22.2
  • PyArrow version: 15.0.2
  • Pandas version: 2.2.2
  • fsspec version: 2023.5.0

loicmagne avatar Apr 10 '24 21:04 loicmagne

Hi !

It's possible to multiple files at once:

data_files = "data/*.jsonl"
# Or pass a list of files
langs = ['ka-ml', 'br-sr', 'ka-pt', 'id-ko', ..., 'fi-ze_zh', 'he-kk', 'ka-tr']
data_files = [f"data/{lang}.jsonl" for lang in langs]
ds = load_dataset("loicmagne/open-subtitles-250-bitext-mining", data_files=data_files, split="train")

Also maybe you can add a subset called "all" for people that want to load all the data without having to list all the languages ?

  - config_name: all
    data_files: data/*.jsonl

lhoestq avatar Apr 15 '24 16:04 lhoestq

Thanks for your reply, it is indeed much faster, however the result is a dataset where all the subsets are "merged" together, the language pair is lost:

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2'],
        num_rows: 247809
    })
})

I guess I could add a 'lang' feature for each row in the dataset, is there a better way to do it ?

loicmagne avatar Apr 15 '24 18:04 loicmagne

Hi @lhoestq over at https://github.com/embeddings-benchmark/mteb/issues/530 we have started examining these issues and would love to make a PR for datasets if we believe there is a way to improve the speed. As I assume you have a better overview than me @lhoestq, would you be interested in a PR, and might you have an idea about where we would start working on it?

We see a speed comparison of

  1. 15 minutes (for ~20% of the languages) when loaded using a for loop
  2. 17 minutes using the your suggestion
  3. ~30 seconds when using @loicmagne "merged" method.

Worth mentioning is that solution 2 looses the language information.

KennethEnevoldsen avatar Apr 24 '24 07:04 KennethEnevoldsen

Can you retry using datasets 2.19 ? We improved a lot the speed of downloading datasets with tons of small files.

pip install -U datasets

Now this takes 17sec on my side instead of the 17min minutes @loicmagne mentioned :)

>>> %time ds = load_dataset("loicmagne/open-subtitles-250-bitext-mining", data_files="data/*.jsonl")
Downloading readme: 100%|█████████████████████████████████| 13.7k/13.7k [00:00<00:00, 5.47MB/s]
Resolving data files: 100%|█████████████████████████████████| 250/250 [00:00<00:00, 612.51it/s]
Downloading data: 100%|██████████████████████████████████| 250/250 [00:12<00:00, 19.68files/s]
Generating train split: 247809 examples [00:00, 1057071.08 examples/s]
CPU times: user 4.95 s, sys: 3.1 s, total: 8.05 s
Wall time: 17.4 s

lhoestq avatar Apr 24 '24 13:04 lhoestq

Can you retry using datasets 2.19 ? We improved a lot the speed of downloading datasets with tons of small files.

pip install -U datasets

Now this takes 17sec on my side instead of the 17min minutes @loicmagne mentioned :)

>>> %time ds = load_dataset("loicmagne/open-subtitles-250-bitext-mining", data_files="data/*.jsonl")
Downloading readme: 100%|█████████████████████████████████| 13.7k/13.7k [00:00<00:00, 5.47MB/s]
Resolving data files: 100%|█████████████████████████████████| 250/250 [00:00<00:00, 612.51it/s]
Downloading data: 100%|██████████████████████████████████| 250/250 [00:12<00:00, 19.68files/s]
Generating train split: 247809 examples [00:00, 1057071.08 examples/s]
CPU times: user 4.95 s, sys: 3.1 s, total: 8.05 s
Wall time: 17.4 s

I was actually just noticing that, I bumped from 2.18 to 2.19 and got a massive speedup, amazing!

About the fact that subset names are lost when loading all files at once, currently my solution is to add a 'lang' feature to each rows, convert to polars and use:

ds_split = ds.to_polars().group_by('lang')

It's fast so I think it's an acceptable solution, but is there a better way to do it ?

loicmagne avatar Apr 24 '24 13:04 loicmagne

It's the fastest way I think :)

Alternatively you can download the dataset repository locally using huggingface_hub (either via CLI or in python) and load the subsets one by one locally using a for loop as you were doing before (just pass the directory path to load_dataset instead of the dataset_id).

lhoestq avatar Apr 24 '24 13:04 lhoestq