datasets Cannot load the cache when mapping the dataset

Describe the bug

I'm training the flux controlnet. The train_dataset.map() takes long time to finish. However, when I killed one training process and want to restart a new training with the same dataset. I can't reuse the mapped result even I defined the cache dir for the dataset.

with accelerator.main_process_first(): from datasets.fingerprint import Hasher

    # fingerprint used by the cache for the other processes to load the result
    # details: https://github.com/huggingface/diffusers/pull/4038#discussion_r1266078401
    new_fingerprint = Hasher.hash(args)
    train_dataset = train_dataset.map(
        compute_embeddings_fn, batched=True, new_fingerprint=new_fingerprint, batch_size=10,
    )

Steps to reproduce the bug

train flux controlnet and start again

Expected behavior

will not map again

Environment info

latest diffusers

Oct 29 '24 08:10 zhangn77

@zhangn77 Hi ，have you solved this problem? I encountered the same issue during training. Could we discuss it?

Feb 24 '25 07:02 shawn-AI-Tech

I also encountered the same problem, why is that？

Mar 24 '25 13:03 MtYCNN