datasets
datasets copied to clipboard
Caching map result of DatasetDict.
Hi!
I'm currenty using the map function to tokenize a somewhat large dataset, so I need to use the cache to save ~25 mins.
Changing num_proc incduces the recomputation of the map, I'm not sure why and if this is excepted behavior?
here it says, that cached files are loaded sequentially:
https://github.com/huggingface/datasets/blob/bb2664cf540d5ce4b066365e7c8b26e7f1ca4743/src/datasets/arrow_dataset.py#L3005-L3006
it seems like I can pass in a fingerprint, and load it directly:
https://github.com/huggingface/datasets/blob/bb2664cf540d5ce4b066365e7c8b26e7f1ca4743/src/datasets/arrow_dataset.py#L3108-L3125
Environment Setup:
- Python 3.11.9
- datasets 2.19.1 conda-forge
- Linux 6.1.83-1.el9.elrepo.x86_64
MRE
fixed raw_datasets
fixed tokenize_function
tokenized_datasets = raw_datasets.map(
tokenize_function,
batched=True,
num_proc=9,
remove_columns=['text'],
load_from_cache_file= True,
desc="Running tokenizer on dataset line_by_line",
)
tokenized_datasets = raw_datasets.map(
tokenize_function,
batched=True,
num_proc=5,
remove_columns=['text'],
load_from_cache_file= True,
desc="Running tokenizer on dataset line_by_line",
)