Quentin Lhoest
Quentin Lhoest
For the caching maybe we can have `Dataset.from_generator` as TF and pickle+hash the generator function (not the generator object itself) ? And then keep `Dataset.from_iterable` fo pickable objects like lists
You can just use `Dataset.from_file` to get your dataset, no need to do an extra `save_to_disk` somewhere else ;)
`save_to_disk`/`load_from_disk` is indeed more general, e.g. it supports datasets that consist in several files, and saves some extra info in a dataset_info.json file (description, citation, split sizes, etc.) If you...
This is related to https://github.com/huggingface/datasets/issues/3547
Thanks for opening this issue :) If it can help, I think you can already use `huggingface_hub` to achieve this: ```python >>> from huggingface_hub import HfApi >>> [ds_info.id for ds_info...
Hi ! Thanks for reporting Yes `recurse=True` is necessary to be able to hash all the objects that are passed to the `map` function EDIT: hopefully this object can be...
It looks like every time you load `en_core_web_sm` you get a different python object: ```python import spacy from datasets.fingerprint import Hasher nlp1 = spacy.load("en_core_web_sm") nlp2 = spacy.load("en_core_web_sm") Hasher.hash(nlp1), Hasher.hash(nlp2) #...
It can be even simpler to hash the bytes of the pipeline instead ```python nlp1.to_bytes() == nlp2.to_bytes() # True ``` IMO we should integrate the custom hashing for spacy models...
Hi ! I just answered in your PR :) In order for your custom hashing to be used for nested objects, you must integrate it into our recursive pickler that...
Hi ! If your function is not picklable, then the fingerprint of the resulting dataset can't be computed. The fingerprint is a hash that is used by the cache to...