datasets "Property couldn't be hashed properly" even though fully picklable

Describe the bug

I am trying to tokenize a dataset with spaCy. I found that no matter what I do, the spaCy language object (nlp) prevents datasets from pickling correctly - or so the warning says - even though manually pickling is no issue. It should not be an issue either, since spaCy objects are picklable.

Steps to reproduce the bug

Here is a colab but for some reason I cannot reproduce it there. That may have to do with logging/tqdm on Colab, or with running things in notebooks. I tried below code on Windows and Ubuntu as a Python script and getting the same issue (warning below).

import pickle

from datasets import load_dataset
import spacy


class Processor:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner", "lemmatizer"])

    @staticmethod
    def collate(batch):
        return [d["en"] for d in batch]

    def parse(self, batch):
        batch = batch["translation"]
        return {"translation_tok": [{"en_tok": " ".join([t.text for t in doc])} for doc in self.nlp.pipe(self.collate(batch))]}

    def process(self):
        ds = load_dataset("wmt16", "de-en", split="train[:10%]")
        ds = ds.map(self.parse, batched=True, num_proc=6)


if __name__ == '__main__':
    pr = Processor()

    # succeeds
    with open("temp.pkl", "wb") as f:
        pickle.dump(pr, f)
    print("Successfully pickled!")

    pr.process()

Here is a small change that includes Hasher.hash that shows that the hasher cannot seem to successfully pickle parts form the NLP object.


from datasets.fingerprint import Hasher
import pickle

from datasets import load_dataset
import spacy


class Processor:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner", "lemmatizer"])

    @staticmethod
    def collate(batch):
        return [d["en"] for d in batch]

    def parse(self, batch):
        batch = batch["translation"]
        return {"translation_tok": [{"en_tok": " ".join([t.text for t in doc])} for doc in self.nlp.pipe(self.collate(batch))]}

    def process(self):
        ds = load_dataset("wmt16", "de-en", split="train[:10]")
        return ds.map(self.parse, batched=True)


if __name__ == '__main__':
    pr = Processor()

    # succeeds
    with open("temp.pkl", "wb") as f:
        pickle.dump(pr, f)
    print("Successfully pickled class instance!")

    # succeeds
    with open("temp.pkl", "wb") as f:
        pickle.dump(pr.nlp, f)
    print("Successfully pickled nlp!")

    # fails
    print(Hasher.hash(pr.nlp))
    pr.process()

Expected results

This to be picklable, working (fingerprinted), and no warning.

Actual results

In the first snippet, I get this warning Parameter 'function'=<function Processor.parse at 0x7f44982247a0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.

In the second, I get this traceback which directs to the Hasher.hash line.

Traceback (most recent call last):
  File " \Python\Python36\lib\pickle.py", line 918, in save_global
    obj2, parent = _getattribute(module, name)
  File " \Python\Python36\lib\pickle.py", line 266, in _getattribute
    .format(name, obj))
AttributeError: Can't get local attribute 'add_codes.<locals>.ErrorsWithCodes' on <function add_codes at 0x00000296FF606EA0>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File " scratch_4.py", line 40, in <module>
    print(Hasher.hash(pr.nlp))
  File " \lib\site-packages\datasets\fingerprint.py", line 191, in hash
    return cls.hash_default(value)
  File " \lib\site-packages\datasets\fingerprint.py", line 184, in hash_default
    return cls.hash_bytes(dumps(value))
  File " \lib\site-packages\datasets\utils\py_utils.py", line 345, in dumps
    dump(obj, file)
  File " \lib\site-packages\datasets\utils\py_utils.py", line 320, in dump
    Pickler(file, recurse=True).dump(obj)
  File " \lib\site-packages\dill\_dill.py", line 498, in dump
    StockPickler.dump(self, obj)
  File " \Python\Python36\lib\pickle.py", line 409, in dump
    self.save(obj)
  File " \Python\Python36\lib\pickle.py", line 521, in save
    self.save_reduce(obj=obj, *rv)
  File " \Python\Python36\lib\pickle.py", line 634, in save_reduce
    save(state)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \lib\site-packages\dill\_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File " \Python\Python36\lib\pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File " \Python\Python36\lib\pickle.py", line 847, in _batch_setitems
    save(v)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \Python\Python36\lib\pickle.py", line 781, in save_list
    self._batch_appends(obj)
  File " \Python\Python36\lib\pickle.py", line 805, in _batch_appends
    save(x)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \Python\Python36\lib\pickle.py", line 736, in save_tuple
    save(element)
  File " \Python\Python36\lib\pickle.py", line 521, in save
    self.save_reduce(obj=obj, *rv)
  File " \Python\Python36\lib\pickle.py", line 634, in save_reduce
    save(state)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \Python\Python36\lib\pickle.py", line 736, in save_tuple
    save(element)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \lib\site-packages\dill\_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File " \Python\Python36\lib\pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File " \Python\Python36\lib\pickle.py", line 847, in _batch_setitems
    save(v)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \lib\site-packages\dill\_dill.py", line 1176, in save_instancemethod0
    pickler.save_reduce(MethodType, (obj.__func__, obj.__self__), obj=obj)
  File " \Python\Python36\lib\pickle.py", line 610, in save_reduce
    save(args)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \Python\Python36\lib\pickle.py", line 736, in save_tuple
    save(element)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \lib\site-packages\datasets\utils\py_utils.py", line 523, in save_function
    obj=obj,
  File " \Python\Python36\lib\pickle.py", line 610, in save_reduce
    save(args)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \Python\Python36\lib\pickle.py", line 751, in save_tuple
    save(element)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \lib\site-packages\dill\_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File " \Python\Python36\lib\pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File " \Python\Python36\lib\pickle.py", line 847, in _batch_setitems
    save(v)
  File " \Python\Python36\lib\pickle.py", line 521, in save
    self.save_reduce(obj=obj, *rv)
  File " \Python\Python36\lib\pickle.py", line 605, in save_reduce
    save(cls)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \lib\site-packages\dill\_dill.py", line 1439, in save_type
    StockPickler.save_global(pickler, obj, name=name)
  File " \Python\Python36\lib\pickle.py", line 922, in save_global
    (obj, module_name, name))
_pickle.PicklingError: Can't pickle <class 'spacy.errors.add_codes.<locals>.ErrorsWithCodes'>: it's not found as spacy.errors.add_codes.<locals>.ErrorsWithCodes

Environment info

Tried on both Linux and Windows

datasets version: 1.14.0
Platform: Windows-10-10.0.19041-SP0 + Python 3.7.9; Linux-5.11.0-38-generic-x86_64-with-Ubuntu-20.04-focal + Python 3.7.12
PyArrow version: 6.0.0

Oct 29 '21 12:10 BramVanroy

After some digging, I found that this is caused by dill and using recurse=True) when trying to dump the object. The problem also occurs without multiprocessing. I can only find the following information about this:

If recurse=True, then objects referred to in the global dictionary are recursively traced and pickled, instead of the default behavior of attempting to store the entire global dictionary. This is needed for functions defined via exec().

In the utils, this is explicitly enabled

https://github.com/huggingface/datasets/blob/df63614223bf1dd1feb267d39d741bada613352c/src/datasets/utils/py_utils.py#L327-L330

Is this really necessary? Is there a way around it? Also pinging the spaCy team in case this is easy to solve on their end. (I hope so.)

Nov 01 '21 12:11 BramVanroy

Hi ! Thanks for reporting

Yes recurse=True is necessary to be able to hash all the objects that are passed to the map function

EDIT: hopefully this object can be serializable soon, but otherwise we can consider adding more control to the user on how to hash objects that are not serializable (as mentioned in https://github.com/huggingface/datasets/issues/3044#issuecomment-948818210)

Nov 02 '21 10:11 lhoestq

I submitted a PR to spacy that should fix this issue (linked above). I'll leave this open until that PR is merged.

Nov 02 '21 18:11 BramVanroy

@lhoestq After some testing I find that even with the updated spaCy, no cache files are used. I do not get any warnings though, but I can see that map is run every time I run the code. Do you have thoughts about why? If you want to try the tests below, make sure to install spaCy from here and installing the base model with python -m spacy download en_core_web_sm.

from functools import partial
from pathlib import Path

import spacy
from datasets import Dataset
import datasets
datasets.logging.set_verbosity_debug()

def tokenize(nlp, l):
    return {"tok": [t.text for t in nlp(l["text"])]}

def main():
    fin = r"some/file/with/many/lines"
    lines = Path(fin).read_text(encoding="utf-8").splitlines()
    nlp = spacy.load("en_core_web_sm")
    ds = Dataset.from_dict({"text": lines, "text_id": list(range(len(lines)))})
    tok = partial(tokenize, nlp)
    ds = ds.map(tok, load_from_cache_file=True)
    print(ds[0:2])

if __name__ == '__main__':
    main()

... or with load_dataset (here I get the message that load_dataset can reuse the dataset, but still I see all samples being processed via the tqdm progressbar):

from functools import partial

import spacy
from datasets import load_dataset
import datasets
datasets.logging.set_verbosity_debug()

def tokenize(nlp, sample):
    return {"tok": [t.text for t in nlp(sample["text"])]}

def main():
    fin = r"some/file/with/many/lines"
    nlp = spacy.load("en_core_web_sm")
    tok_func = partial(tokenize, nlp)
    ds = load_dataset('text', data_files=fin)
    ds = ds["train"].map(tok_func)
    print(ds[0:2])

if __name__ == '__main__':
    main()

Nov 02 '21 19:11 BramVanroy

It looks like every time you load en_core_web_sm you get a different python object:

import spacy
from datasets.fingerprint import Hasher

nlp1 = spacy.load("en_core_web_sm")
nlp2 = spacy.load("en_core_web_sm")
Hasher.hash(nlp1), Hasher.hash(nlp2)
# ('f6196a33882fea3b', 'a4c676a071f266ff')

Here is a list of attributes that have different hashes for nlp1 and nlp2:

tagger
parser
entity
pipeline (it's the list of the three attributes above)

I just took a look at the tagger for example and I found subtle differences (there may be other differences though):

nlp1.tagger.model.tok2vec.embed.id, nlp2.tagger.model.tok2vec.embed.id
# (1721, 2243)

We can try to find all the differences and find the best way to hash those objects properly

Nov 03 '21 10:11 lhoestq

Thanks for searching! I went looking, and found that this is an implementation detail of thinc

https://github.com/explosion/thinc/blob/68691e303ae68cae4bc803299016f1fc064328bf/thinc/model.py#L96-L98

Presumably (?) exactly to distinguish between different parts in memory when multiple models are loaded. Do not think that this can be changed on their end - but I will ask what exactly it is for (I'm curious).

Do you think it is overkill to write something into the hasher explicitly to deal with spaCy models? It seems like something that is beneficial to many, but I do not know if you are open to adding third-party-specific ways to deal with this. If you are, I can have a look for this specific case how we can ignore thinc.Model.id from the hasher.

Nov 03 '21 10:11 BramVanroy

It can be even simpler to hash the bytes of the pipeline instead

nlp1.to_bytes() == nlp2.to_bytes()  # True

IMO we should integrate the custom hashing for spacy models into datasets (we use a custom Pickler for that). What could be done on Spacy's side instead (if they think it's nice to have) is to implement a custom pickling for these classes using to_bytes/from_bytes to have deterministic pickle dumps.

Finally I think it would be nice in the future to add an API to let datasets users control this kind of things. Something like being able to define your own hashing if you use complex objects.

@datasets.register_hash(spacy.language.Language)
def hash_spacy_language(nlp):
    return Hasher.hash(nlp.to_bytes())

Nov 03 '21 13:11 lhoestq

I do not quite understand what you mean. as far as I can tell, using to_bytes does a pickle dump behind the scene (with srsly), recursively using to_bytes on the required objects. Therefore, the result of to_bytes is a deterministic pickle dump AFAICT. Or do you mean that you wish that using your own pickler and running dumps(nlp) should also be deterministic? I guess that would require __setstate__ and __getstate__ methods on all the objects that have to/from_bytes. I'll have a listen over at spaCy what they think, and if that would solve the issue. I'll try this locally first, if I find the time.

I agree that having the option to use a custom hasher would be useful. I like your suggestion!

EDIT: after trying some things and reading through their API, it seems that they explicitly do not want this. https://spacy.io/usage/saving-loading#pipeline

When serializing the pipeline, keep in mind that this will only save out the binary data for the individual components to allow spaCy to restore them – not the entire objects. This is a good thing, because it makes serialization safe. But it also means that you have to take care of storing the config, which contains the pipeline configuration and all the relevant settings.

Best way forward therefore seems to implement the ability to specify a hasher depending on the objects that are pickled, as you suggested. I can work on this if that is useful. I could use some pointers as to how you would like to implement the register_hash functionality though. I assume using catalogue over at Explosion might be a good starting point.

Nov 03 '21 13:11 BramVanroy

Interestingly, my PR does not solve the issue discussed above. The tokenize function hash is different on every run, because for some reason nlp.__call__ has a different hash every time. The issue therefore seems to run much deeper than I thought. If you have any ideas, I'm all ears.

git clone https://github.com/explosion/spaCy.git
cd spaCy/
git checkout cab9209c3dfcd1b75dfe5657f10e52c4d847a3cf
cd ..

git clone https://github.com/BramVanroy/datasets.git
cd datasets
git checkout registry
pip install -e .
pip install ../spaCy
spacy download en_core_web_sm

import spacy

from datasets import load_dataset
from datasets.fingerprint import Hasher
from datasets.utils.registry import hashers

@hashers.register(spacy.Language)
def hash_spacy_language(nlp):
    return Hasher.hash(nlp.to_bytes())

def main():
    fin = r"your/large/file"
    nlp = spacy.load("en_core_web_sm")
    # This is now always the same yay!
    print(Hasher.hash(nlp))

    def tokenize(l):
        return {"tok": [t.text for t in nlp(l["text"])]}

    ds =  load_dataset("text", data_files=fin)
    # But this is not...
    print(Hasher.hash(tokenize))
    # ... because of this
    print(Hasher.hash(nlp.__call__))
    ds = ds["train"].map(tokenize)
    print(ds[0:2])


if __name__ == '__main__':
    main()

Nov 03 '21 23:11 BramVanroy

Hi ! I just answered in your PR :) In order for your custom hashing to be used for nested objects, you must integrate it into our recursive pickler that we use for hashing.

Nov 04 '21 11:11 lhoestq

I don't quite understand the design constraints of datasets or the script that you're running, but my usual advice is to avoid using pickle unless you absolutely have to. So for instance instead of doing your partial over the nlp object itself, can you just pass the string en_core_web_sm in? This will mean calling spacy.load() inside the work function, but this is no worse than having to call pickle.load() on the contents of the NLP object anyway -- in fact you'll generally find spacy.load() faster, apart from the disk read.

If you need to pass in the bytes data and don't want to read from disk, you could do something like this:

msg = (nlp.lang, nlp.to_bytes())

def unpack(lang, bytes_data):
    return spacy.blank(lang).from_bytes(bytes_data)

I think that should probably work: the Thinc model.to_dict() method (which is used by the model.to_bytes() method) doesn't pack the model's ID into the message, so the nlp.to_bytes() that you get shouldn't be affected by the global IDs. So you should get a clean message from nlp.to_bytes() that doesn't depend on the global state.

Nov 17 '21 22:11 honnibal

Hi Matthew, thanks for chiming in! We are currently implementing exactly what you suggest: to_bytes() as a default before pickling - but we may prefer to_dict to avoid double dumping.

datasets uses pickle dumps (actually dill) to get unique representations of processing steps (a "fingerprint" or hash). So it never needs to re-load that dump - it just needs its value to create a hash. If a fingerprint is identical to a cached fingerprint, then the result can be retrieved from the on-disk cache. (@lhoestq or @mariosasko can correct me if I'm wrong.)

I was experiencing the issue that parsing with spaCy gave me a different fingerprint on every run of the script and thus it could never load the processed dataset from cache. At first I thought the reason was that spaCy Language objects were not picklable with recursive dill, but even after adjusting for that the issue persisted. @lhoestq found that this is due to the changing id, which you discussed here. So yes, you are right. On the surface there simply seems to be an incompatibility between datasets default caching functionality as it is currently implemented and spacy.Language.

The linked PR aims to remedy that, though. Up to now I have put some effort into making it easier to define your own "pickling" function for a given type (and optionally any of its subclasses). That allows us to tell datasets that instead of doing dill.save(nlp) (non-deterministic), to use dill.save(nlp.to_bytes()) (deterministic). When I find some more time, the PR will be expanded to improve the user-experience a bit and add a built-in function to pickle spacy.Language as one of the defaults (using to_bytes()).

Nov 17 '21 23:11 BramVanroy

Is there a workaround for this? maybe by explicitly requesting datasets to cache the result of .map()?

Mar 31 '22 18:03 jxmorris12

Hi ! If your function is not picklable, then the fingerprint of the resulting dataset can't be computed. The fingerprint is a hash that is used by the cache to reload previously computed datasets: the dataset file is named cache-<fingerprint>.arrow in your dataset's cache directory.

As a workaround you can set the fingerprint that is going to be used by the cache:

result = my_dataset.map(func, new_fingerprint=new_fingerprint)

Any future call to map with the same new_fingerprint will reload the result from the cache.

Be careful using this though: if you change your func, be sure to change the new_fingerprint as well.

Apr 01 '22 13:04 lhoestq

I've been having an issue that might be related to this when trying to pre-tokenize a corpus and caching it for using it later in the pre-training of a RoBERTa model. I always get the following warning:

Dataset text downloaded and prepared to /gpfswork/rech/project/user/.cache/hf-datasets/text/default-1850886023af0077/0.0.0/acc32f2f2ef863c93c2f30c52f7df6cc9053a1c2230b8d7da0d210404683ca08. Subsequent calls will reuse this data.
Parameter 'function'=<function encode_dataset.<locals>.<lambda> at 0x14a92157b280> of the transform [email protected] couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.

And when I launch the pre-training the pre-tokenized corpus is not found and it is tokenized again, which makes me waste precious GPU hours.

For me, the workaround was downgrading dill and multiprocess to the following versions:

dill             0.3.4
multiprocess     0.70.12.2

Jul 19 '22 18:07 pjox

Hi ! If your function is not picklable, then the fingerprint of the resulting dataset can't be computed. The fingerprint is a hash that is used by the cache to reload previously computed datasets: the dataset file is named cache-<fingerprint>.arrow in your dataset's cache directory.

As a workaround you can set the fingerprint that is going to be used by the cache:
result = my_dataset.map(func, new_fingerprint=new_fingerprint)
Any future call to map with the same new_fingerprint will reload the result from the cache.

Be careful using this though: if you change your func, be sure to change the new_fingerprint as well.

Is the argument new_fingerprint available for datasetDict ? I can only use it on arrow datasets but might be useful to generalize it to DatasetDict's map as well ? @lhoestq

Aug 09 '22 10:08 ManuelFay

I've been having an issue that might be related to this when trying to pre-tokenize a corpus and caching it for using it later in the pre-training of a RoBERTa model. I always get the following warning:
Dataset text downloaded and prepared to /gpfswork/rech/project/user/.cache/hf-datasets/text/default-1850886023af0077/0.0.0/acc32f2f2ef863c93c2f30c52f7df6cc9053a1c2230b8d7da0d210404683ca08. Subsequent calls will reuse this data.
Parameter 'function'=<function encode_dataset.<locals>.<lambda> at 0x14a92157b280> of the transform [email protected] couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
And when I launch the pre-training the pre-tokenized corpus is not found and it is tokenized again, which makes me waste precious GPU hours.

For me, the workaround was downgrading dill and multiprocess to the following versions:
dill             0.3.4
multiprocess     0.70.12.2             

This worked for me - thanks!

Aug 11 '22 01:08 iicky

I see this has just been closed - it seems quite relevant to another tokenizer I have been trying to use, the vinai/phobert family of tokenizers

https://huggingface.co/vinai/phobert-base https://huggingface.co/vinai/phobert-large

I ran into an issue where a large dataset took several hours to tokenize, the process hung, and I was unable to use the cached version of the tokenized data:

https://discuss.huggingface.co/t/cache-parallelize-long-tokenization-step/25791/3

I don't see any way to specify the hash of the tokenizer or the fingerprint of the tokenized data to use, so is the tokenized dataset basically lost at this point? Is there a good way to avoid this happening again if I retokenize the data?

Nov 11 '22 04:11 AngledLuffa

In your case it looks like the job failed before caching the data - maybe one of the processes crashed

Nov 12 '22 15:11 lhoestq

Interesting. Thanks for the observation. Any suggestions on how to start tracking that down? Perhaps run it singlethreaded and see if it crashes?

Nov 14 '22 07:11 AngledLuffa

You can monitor your RAM and disk space in case a process dies from OOM or disk full, and when it hangs you can check how many processes are running. IIRC there are other start methods for multiprocessing in python that may show an error message if a process dies.

Running on a single process can also help debugging this indeed

Nov 14 '22 15:11 lhoestq

https://github.com/huggingface/datasets/issues/3178#issuecomment-1189435462

The solution does not solve for using commonvoice dataset ("mozilla-foundation/common_voice_11_0")

Jan 04 '23 03:01 tung-msol

Hi @tung-msol could you open a new issue and share the error you got and the map function you used ?

Jan 04 '23 15:01 lhoestq

datasets datasets copied to clipboard

"Property couldn't be hashed properly" even though fully picklable

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

Environment info

datasets
datasets copied to clipboard