datasets
datasets copied to clipboard
Cannot load the cache when mapping the dataset
Describe the bug
I'm training the flux controlnet. The train_dataset.map() takes long time to finish. However, when I killed one training process and want to restart a new training with the same dataset. I can't reuse the mapped result even I defined the cache dir for the dataset.
with accelerator.main_process_first(): from datasets.fingerprint import Hasher
# fingerprint used by the cache for the other processes to load the result
# details: https://github.com/huggingface/diffusers/pull/4038#discussion_r1266078401
new_fingerprint = Hasher.hash(args)
train_dataset = train_dataset.map(
compute_embeddings_fn, batched=True, new_fingerprint=new_fingerprint, batch_size=10,
)
Steps to reproduce the bug
train flux controlnet and start again
Expected behavior
will not map again
Environment info
latest diffusers
@zhangn77 Hi ,have you solved this problem? I encountered the same issue during training. Could we discuss it?
I also encountered the same problem, why is that?