datasets Caching map result of DatasetDict.

Caching map result of DatasetDict.

Open MostHumble opened this issue 9 months ago • 0 comments

Hi!

I'm currenty using the map function to tokenize a somewhat large dataset, so I need to use the cache to save ~25 mins.

Changing num_proc incduces the recomputation of the map, I'm not sure why and if this is excepted behavior?

here it says, that cached files are loaded sequentially:

https://github.com/huggingface/datasets/blob/bb2664cf540d5ce4b066365e7c8b26e7f1ca4743/src/datasets/arrow_dataset.py#L3005-L3006

it seems like I can pass in a fingerprint, and load it directly:

https://github.com/huggingface/datasets/blob/bb2664cf540d5ce4b066365e7c8b26e7f1ca4743/src/datasets/arrow_dataset.py#L3108-L3125

Environment Setup:

Python 3.11.9
datasets 2.19.1 conda-forge
Linux 6.1.83-1.el9.elrepo.x86_64

MRE

fixed raw_datasets
fixed tokenize_function

tokenized_datasets = raw_datasets.map(
                        tokenize_function,
                        batched=True,
                        num_proc=9,
                        remove_columns=['text'],
                        load_from_cache_file= True,
                        desc="Running tokenizer on dataset line_by_line",
                    )


tokenized_datasets = raw_datasets.map(
                        tokenize_function,
                        batched=True,
                        num_proc=5,
                        remove_columns=['text'],
                        load_from_cache_file= True,
                        desc="Running tokenizer on dataset line_by_line",
                    )

May 28 '24 09:05 MostHumble

datasets datasets copied to clipboard

Caching map result of DatasetDict.

datasets
datasets copied to clipboard