datasets
datasets copied to clipboard
"Property couldn't be hashed properly" even though fully picklable
Describe the bug
I am trying to tokenize a dataset with spaCy. I found that no matter what I do, the spaCy language object (nlp
) prevents datasets
from pickling correctly - or so the warning says - even though manually pickling is no issue. It should not be an issue either, since spaCy objects are picklable.
Steps to reproduce the bug
Here is a colab but for some reason I cannot reproduce it there. That may have to do with logging/tqdm on Colab, or with running things in notebooks. I tried below code on Windows and Ubuntu as a Python script and getting the same issue (warning below).
import pickle
from datasets import load_dataset
import spacy
class Processor:
def __init__(self):
self.nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner", "lemmatizer"])
@staticmethod
def collate(batch):
return [d["en"] for d in batch]
def parse(self, batch):
batch = batch["translation"]
return {"translation_tok": [{"en_tok": " ".join([t.text for t in doc])} for doc in self.nlp.pipe(self.collate(batch))]}
def process(self):
ds = load_dataset("wmt16", "de-en", split="train[:10%]")
ds = ds.map(self.parse, batched=True, num_proc=6)
if __name__ == '__main__':
pr = Processor()
# succeeds
with open("temp.pkl", "wb") as f:
pickle.dump(pr, f)
print("Successfully pickled!")
pr.process()
Here is a small change that includes Hasher.hash
that shows that the hasher cannot seem to successfully pickle parts form the NLP object.
from datasets.fingerprint import Hasher
import pickle
from datasets import load_dataset
import spacy
class Processor:
def __init__(self):
self.nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner", "lemmatizer"])
@staticmethod
def collate(batch):
return [d["en"] for d in batch]
def parse(self, batch):
batch = batch["translation"]
return {"translation_tok": [{"en_tok": " ".join([t.text for t in doc])} for doc in self.nlp.pipe(self.collate(batch))]}
def process(self):
ds = load_dataset("wmt16", "de-en", split="train[:10]")
return ds.map(self.parse, batched=True)
if __name__ == '__main__':
pr = Processor()
# succeeds
with open("temp.pkl", "wb") as f:
pickle.dump(pr, f)
print("Successfully pickled class instance!")
# succeeds
with open("temp.pkl", "wb") as f:
pickle.dump(pr.nlp, f)
print("Successfully pickled nlp!")
# fails
print(Hasher.hash(pr.nlp))
pr.process()
Expected results
This to be picklable, working (fingerprinted), and no warning.
Actual results
In the first snippet, I get this warning Parameter 'function'=<function Processor.parse at 0x7f44982247a0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
In the second, I get this traceback which directs to the Hasher.hash
line.
Traceback (most recent call last):
File " \Python\Python36\lib\pickle.py", line 918, in save_global
obj2, parent = _getattribute(module, name)
File " \Python\Python36\lib\pickle.py", line 266, in _getattribute
.format(name, obj))
AttributeError: Can't get local attribute 'add_codes.<locals>.ErrorsWithCodes' on <function add_codes at 0x00000296FF606EA0>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File " scratch_4.py", line 40, in <module>
print(Hasher.hash(pr.nlp))
File " \lib\site-packages\datasets\fingerprint.py", line 191, in hash
return cls.hash_default(value)
File " \lib\site-packages\datasets\fingerprint.py", line 184, in hash_default
return cls.hash_bytes(dumps(value))
File " \lib\site-packages\datasets\utils\py_utils.py", line 345, in dumps
dump(obj, file)
File " \lib\site-packages\datasets\utils\py_utils.py", line 320, in dump
Pickler(file, recurse=True).dump(obj)
File " \lib\site-packages\dill\_dill.py", line 498, in dump
StockPickler.dump(self, obj)
File " \Python\Python36\lib\pickle.py", line 409, in dump
self.save(obj)
File " \Python\Python36\lib\pickle.py", line 521, in save
self.save_reduce(obj=obj, *rv)
File " \Python\Python36\lib\pickle.py", line 634, in save_reduce
save(state)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \lib\site-packages\dill\_dill.py", line 990, in save_module_dict
StockPickler.save_dict(pickler, obj)
File " \Python\Python36\lib\pickle.py", line 821, in save_dict
self._batch_setitems(obj.items())
File " \Python\Python36\lib\pickle.py", line 847, in _batch_setitems
save(v)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \Python\Python36\lib\pickle.py", line 781, in save_list
self._batch_appends(obj)
File " \Python\Python36\lib\pickle.py", line 805, in _batch_appends
save(x)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \Python\Python36\lib\pickle.py", line 736, in save_tuple
save(element)
File " \Python\Python36\lib\pickle.py", line 521, in save
self.save_reduce(obj=obj, *rv)
File " \Python\Python36\lib\pickle.py", line 634, in save_reduce
save(state)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \Python\Python36\lib\pickle.py", line 736, in save_tuple
save(element)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \lib\site-packages\dill\_dill.py", line 990, in save_module_dict
StockPickler.save_dict(pickler, obj)
File " \Python\Python36\lib\pickle.py", line 821, in save_dict
self._batch_setitems(obj.items())
File " \Python\Python36\lib\pickle.py", line 847, in _batch_setitems
save(v)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \lib\site-packages\dill\_dill.py", line 1176, in save_instancemethod0
pickler.save_reduce(MethodType, (obj.__func__, obj.__self__), obj=obj)
File " \Python\Python36\lib\pickle.py", line 610, in save_reduce
save(args)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \Python\Python36\lib\pickle.py", line 736, in save_tuple
save(element)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \lib\site-packages\datasets\utils\py_utils.py", line 523, in save_function
obj=obj,
File " \Python\Python36\lib\pickle.py", line 610, in save_reduce
save(args)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \Python\Python36\lib\pickle.py", line 751, in save_tuple
save(element)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \lib\site-packages\dill\_dill.py", line 990, in save_module_dict
StockPickler.save_dict(pickler, obj)
File " \Python\Python36\lib\pickle.py", line 821, in save_dict
self._batch_setitems(obj.items())
File " \Python\Python36\lib\pickle.py", line 847, in _batch_setitems
save(v)
File " \Python\Python36\lib\pickle.py", line 521, in save
self.save_reduce(obj=obj, *rv)
File " \Python\Python36\lib\pickle.py", line 605, in save_reduce
save(cls)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \lib\site-packages\dill\_dill.py", line 1439, in save_type
StockPickler.save_global(pickler, obj, name=name)
File " \Python\Python36\lib\pickle.py", line 922, in save_global
(obj, module_name, name))
_pickle.PicklingError: Can't pickle <class 'spacy.errors.add_codes.<locals>.ErrorsWithCodes'>: it's not found as spacy.errors.add_codes.<locals>.ErrorsWithCodes
Environment info
Tried on both Linux and Windows
-
datasets
version: 1.14.0 - Platform: Windows-10-10.0.19041-SP0 + Python 3.7.9; Linux-5.11.0-38-generic-x86_64-with-Ubuntu-20.04-focal + Python 3.7.12
- PyArrow version: 6.0.0
After some digging, I found that this is caused by dill
and using recurse=True)
when trying to dump the object. The problem also occurs without multiprocessing. I can only find the following information about this:
If recurse=True, then objects referred to in the global dictionary are recursively traced and pickled, instead of the default behavior of attempting to store the entire global dictionary. This is needed for functions defined via exec().
In the utils, this is explicitly enabled
https://github.com/huggingface/datasets/blob/df63614223bf1dd1feb267d39d741bada613352c/src/datasets/utils/py_utils.py#L327-L330
Is this really necessary? Is there a way around it? Also pinging the spaCy team in case this is easy to solve on their end. (I hope so.)
Hi ! Thanks for reporting
Yes recurse=True
is necessary to be able to hash all the objects that are passed to the map
function
EDIT: hopefully this object can be serializable soon, but otherwise we can consider adding more control to the user on how to hash objects that are not serializable (as mentioned in https://github.com/huggingface/datasets/issues/3044#issuecomment-948818210)
I submitted a PR to spacy that should fix this issue (linked above). I'll leave this open until that PR is merged.
@lhoestq After some testing I find that even with the updated spaCy, no cache files are used. I do not get any warnings though, but I can see that map is run every time I run the code. Do you have thoughts about why? If you want to try the tests below, make sure to install spaCy from here and installing the base model with python -m spacy download en_core_web_sm
.
from functools import partial
from pathlib import Path
import spacy
from datasets import Dataset
import datasets
datasets.logging.set_verbosity_debug()
def tokenize(nlp, l):
return {"tok": [t.text for t in nlp(l["text"])]}
def main():
fin = r"some/file/with/many/lines"
lines = Path(fin).read_text(encoding="utf-8").splitlines()
nlp = spacy.load("en_core_web_sm")
ds = Dataset.from_dict({"text": lines, "text_id": list(range(len(lines)))})
tok = partial(tokenize, nlp)
ds = ds.map(tok, load_from_cache_file=True)
print(ds[0:2])
if __name__ == '__main__':
main()
... or with load_dataset (here I get the message that load_dataset
can reuse the dataset, but still I see all samples being processed via the tqdm progressbar):
from functools import partial
import spacy
from datasets import load_dataset
import datasets
datasets.logging.set_verbosity_debug()
def tokenize(nlp, sample):
return {"tok": [t.text for t in nlp(sample["text"])]}
def main():
fin = r"some/file/with/many/lines"
nlp = spacy.load("en_core_web_sm")
tok_func = partial(tokenize, nlp)
ds = load_dataset('text', data_files=fin)
ds = ds["train"].map(tok_func)
print(ds[0:2])
if __name__ == '__main__':
main()
It looks like every time you load en_core_web_sm
you get a different python object:
import spacy
from datasets.fingerprint import Hasher
nlp1 = spacy.load("en_core_web_sm")
nlp2 = spacy.load("en_core_web_sm")
Hasher.hash(nlp1), Hasher.hash(nlp2)
# ('f6196a33882fea3b', 'a4c676a071f266ff')
Here is a list of attributes that have different hashes for nlp1
and nlp2
:
- tagger
- parser
- entity
- pipeline (it's the list of the three attributes above)
I just took a look at the tagger for example and I found subtle differences (there may be other differences though):
nlp1.tagger.model.tok2vec.embed.id, nlp2.tagger.model.tok2vec.embed.id
# (1721, 2243)
We can try to find all the differences and find the best way to hash those objects properly
Thanks for searching! I went looking, and found that this is an implementation detail of thinc
https://github.com/explosion/thinc/blob/68691e303ae68cae4bc803299016f1fc064328bf/thinc/model.py#L96-L98
Presumably (?) exactly to distinguish between different parts in memory when multiple models are loaded. Do not think that this can be changed on their end - but I will ask what exactly it is for (I'm curious).
Do you think it is overkill to write something into the hasher explicitly to deal with spaCy models? It seems like something that is beneficial to many, but I do not know if you are open to adding third-party-specific ways to deal with this. If you are, I can have a look for this specific case how we can ignore thinc.Model.id
from the hasher.
It can be even simpler to hash the bytes of the pipeline instead
nlp1.to_bytes() == nlp2.to_bytes() # True
IMO we should integrate the custom hashing for spacy models into datasets
(we use a custom Pickler for that).
What could be done on Spacy's side instead (if they think it's nice to have) is to implement a custom pickling for these classes using to_bytes
/from_bytes
to have deterministic pickle dumps.
Finally I think it would be nice in the future to add an API to let datasets
users control this kind of things. Something like being able to define your own hashing if you use complex objects.
@datasets.register_hash(spacy.language.Language)
def hash_spacy_language(nlp):
return Hasher.hash(nlp.to_bytes())
I do not quite understand what you mean. as far as I can tell, using to_bytes
does a pickle dump behind the scene (with srsly
), recursively using to_bytes
on the required objects. Therefore, the result of to_bytes
is a deterministic pickle dump AFAICT. Or do you mean that you wish that using your own pickler and running dumps(nlp)
should also be deterministic? I guess that would require __setstate__
and __getstate__
methods on all the objects that have to/from_bytes. I'll have a listen over at spaCy what they think, and if that would solve the issue. I'll try this locally first, if I find the time.
I agree that having the option to use a custom hasher would be useful. I like your suggestion!
EDIT: after trying some things and reading through their API, it seems that they explicitly do not want this. https://spacy.io/usage/saving-loading#pipeline
When serializing the pipeline, keep in mind that this will only save out the binary data for the individual components to allow spaCy to restore them – not the entire objects. This is a good thing, because it makes serialization safe. But it also means that you have to take care of storing the config, which contains the pipeline configuration and all the relevant settings.
Best way forward therefore seems to implement the ability to specify a hasher depending on the objects that are pickled, as you suggested. I can work on this if that is useful. I could use some pointers as to how you would like to implement the register_hash
functionality though. I assume using catalogue
over at Explosion might be a good starting point.
Interestingly, my PR does not solve the issue discussed above. The tokenize
function hash is different on every run, because for some reason nlp.__call__
has a different hash every time. The issue therefore seems to run much deeper than I thought. If you have any ideas, I'm all ears.
git clone https://github.com/explosion/spaCy.git
cd spaCy/
git checkout cab9209c3dfcd1b75dfe5657f10e52c4d847a3cf
cd ..
git clone https://github.com/BramVanroy/datasets.git
cd datasets
git checkout registry
pip install -e .
pip install ../spaCy
spacy download en_core_web_sm
import spacy
from datasets import load_dataset
from datasets.fingerprint import Hasher
from datasets.utils.registry import hashers
@hashers.register(spacy.Language)
def hash_spacy_language(nlp):
return Hasher.hash(nlp.to_bytes())
def main():
fin = r"your/large/file"
nlp = spacy.load("en_core_web_sm")
# This is now always the same yay!
print(Hasher.hash(nlp))
def tokenize(l):
return {"tok": [t.text for t in nlp(l["text"])]}
ds = load_dataset("text", data_files=fin)
# But this is not...
print(Hasher.hash(tokenize))
# ... because of this
print(Hasher.hash(nlp.__call__))
ds = ds["train"].map(tokenize)
print(ds[0:2])
if __name__ == '__main__':
main()
Hi ! I just answered in your PR :) In order for your custom hashing to be used for nested objects, you must integrate it into our recursive pickler that we use for hashing.
I don't quite understand the design constraints of datasets
or the script that you're running, but my usual advice is to avoid using pickle unless you absolutely have to. So for instance instead of doing your partial
over the nlp
object itself, can you just pass the string en_core_web_sm
in? This will mean calling spacy.load()
inside the work function, but this is no worse than having to call pickle.load()
on the contents of the NLP object anyway -- in fact you'll generally find spacy.load()
faster, apart from the disk read.
If you need to pass in the bytes data and don't want to read from disk, you could do something like this:
msg = (nlp.lang, nlp.to_bytes())
def unpack(lang, bytes_data):
return spacy.blank(lang).from_bytes(bytes_data)
I think that should probably work: the Thinc model.to_dict()
method (which is used by the model.to_bytes()
method) doesn't pack the model's ID into the message, so the nlp.to_bytes()
that you get shouldn't be affected by the global IDs. So you should get a clean message from nlp.to_bytes()
that doesn't depend on the global state.
Hi Matthew, thanks for chiming in! We are currently implementing exactly what you suggest: to_bytes()
as a default before pickling - but we may prefer to_dict
to avoid double dumping.
datasets
uses pickle dumps (actually dill) to get unique representations of processing steps (a "fingerprint" or hash). So it never needs to re-load that dump - it just needs its value to create a hash. If a fingerprint is identical to a cached fingerprint, then the result can be retrieved from the on-disk cache. (@lhoestq or @mariosasko can correct me if I'm wrong.)
I was experiencing the issue that parsing with spaCy gave me a different fingerprint on every run of the script and thus it could never load the processed dataset from cache. At first I thought the reason was that spaCy Language objects were not picklable with recursive dill, but even after adjusting for that the issue persisted. @lhoestq found that this is due to the changing id
, which you discussed here. So yes, you are right. On the surface there simply seems to be an incompatibility between datasets
default caching functionality as it is currently implemented and spacy.Language
.
The linked PR aims to remedy that, though. Up to now I have put some effort into making it easier to define your own "pickling" function for a given type (and optionally any of its subclasses). That allows us to tell datasets
that instead of doing dill.save(nlp)
(non-deterministic), to use dill.save(nlp.to_bytes())
(deterministic). When I find some more time, the PR will be expanded to improve the user-experience a bit and add a built-in function to pickle spacy.Language
as one of the defaults (using to_bytes()
).
Is there a workaround for this? maybe by explicitly requesting datasets to cache the result of .map()
?
Hi ! If your function is not picklable, then the fingerprint of the resulting dataset can't be computed. The fingerprint is a hash that is used by the cache to reload previously computed datasets: the dataset file is named cache-<fingerprint>.arrow
in your dataset's cache directory.
As a workaround you can set the fingerprint that is going to be used by the cache:
result = my_dataset.map(func, new_fingerprint=new_fingerprint)
Any future call to map
with the same new_fingerprint
will reload the result from the cache.
Be careful using this though: if you change your func
, be sure to change the new_fingerprint
as well.
I've been having an issue that might be related to this when trying to pre-tokenize a corpus and caching it for using it later in the pre-training of a RoBERTa model. I always get the following warning:
Dataset text downloaded and prepared to /gpfswork/rech/project/user/.cache/hf-datasets/text/default-1850886023af0077/0.0.0/acc32f2f2ef863c93c2f30c52f7df6cc9053a1c2230b8d7da0d210404683ca08. Subsequent calls will reuse this data.
Parameter 'function'=<function encode_dataset.<locals>.<lambda> at 0x14a92157b280> of the transform [email protected] couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
And when I launch the pre-training the pre-tokenized corpus is not found and it is tokenized again, which makes me waste precious GPU hours.
For me, the workaround was downgrading dill
and multiprocess
to the following versions:
dill 0.3.4
multiprocess 0.70.12.2
Hi ! If your function is not picklable, then the fingerprint of the resulting dataset can't be computed. The fingerprint is a hash that is used by the cache to reload previously computed datasets: the dataset file is named
cache-<fingerprint>.arrow
in your dataset's cache directory.As a workaround you can set the fingerprint that is going to be used by the cache:
result = my_dataset.map(func, new_fingerprint=new_fingerprint)
Any future call to
map
with the samenew_fingerprint
will reload the result from the cache.Be careful using this though: if you change your
func
, be sure to change thenew_fingerprint
as well.
Is the argument new_fingerprint
available for datasetDict ? I can only use it on arrow datasets but might be useful to generalize it to DatasetDict's map as well ? @lhoestq
I've been having an issue that might be related to this when trying to pre-tokenize a corpus and caching it for using it later in the pre-training of a RoBERTa model. I always get the following warning:
Dataset text downloaded and prepared to /gpfswork/rech/project/user/.cache/hf-datasets/text/default-1850886023af0077/0.0.0/acc32f2f2ef863c93c2f30c52f7df6cc9053a1c2230b8d7da0d210404683ca08. Subsequent calls will reuse this data. Parameter 'function'=<function encode_dataset.<locals>.<lambda> at 0x14a92157b280> of the transform [email protected] couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
And when I launch the pre-training the pre-tokenized corpus is not found and it is tokenized again, which makes me waste precious GPU hours.
For me, the workaround was downgrading
dill
andmultiprocess
to the following versions:dill 0.3.4 multiprocess 0.70.12.2
This worked for me - thanks!
I see this has just been closed - it seems quite relevant to another tokenizer I have been trying to use, the vinai/phobert
family of tokenizers
https://huggingface.co/vinai/phobert-base https://huggingface.co/vinai/phobert-large
I ran into an issue where a large dataset took several hours to tokenize, the process hung, and I was unable to use the cached version of the tokenized data:
https://discuss.huggingface.co/t/cache-parallelize-long-tokenization-step/25791/3
I don't see any way to specify the hash of the tokenizer or the fingerprint of the tokenized data to use, so is the tokenized dataset basically lost at this point? Is there a good way to avoid this happening again if I retokenize the data?
In your case it looks like the job failed before caching the data - maybe one of the processes crashed
Interesting. Thanks for the observation. Any suggestions on how to start tracking that down? Perhaps run it singlethreaded and see if it crashes?
You can monitor your RAM and disk space in case a process dies from OOM or disk full, and when it hangs you can check how many processes are running. IIRC there are other start methods for multiprocessing in python that may show an error message if a process dies.
Running on a single process can also help debugging this indeed
https://github.com/huggingface/datasets/issues/3178#issuecomment-1189435462
The solution does not solve for using commonvoice dataset ("mozilla-foundation/common_voice_11_0")
Hi @tung-msol could you open a new issue and share the error you got and the map function you used ?