spacy-universal-sentence-encoder
spacy-universal-sentence-encoder copied to clipboard
Problems installing models on SpaCy 2.3.2
I installed the models as per the directions in the README but they are not showing up in Spacy:
✔ Loaded compatibility table
====================== Installed models (spaCy v2.3.2) ======================
ℹ spaCy installation:
/Users/ckm/.pyenv/versions/3.7.5/lib/python3.7/site-packages/spacy
TYPE NAME MODEL VERSION
package en-vectors-web-lg en_vectors_web_lg 2.3.0 ✔
package en-trf-xlnetbasecased-lg en_trf_xlnetbasecased_lg 2.3.0 ✔
package en-trf-robertabase-lg en_trf_robertabase_lg 2.3.0 ✔
package en-trf-distilbertbaseuncased-lg en_trf_distilbertbaseuncased_lg 2.3.0 ✔
package en-trf-bertbaseuncased-lg en_trf_bertbaseuncased_lg 2.3.0 ✔
package en-core-web-sm en_core_web_sm 2.3.1 ✔
package en-core-web-lg en_core_web_lg 2.3.1 ✔
I'm installing using:
pip install https://github.com/MartinoMensio/spacy-universal-sentence-encoder/releases/download/v0.3.1/en_use_md-0.3.1.tar.gz#en_use_md-0.3.1
and
pip install https://github.com/MartinoMensio/spacy-universal-sentence-encoder/releases/download/v0.3.1/en_use_lg-0.3.1.tar.gz#en_use_lg-0.3.1
both of which complete successfully.
But calling nlp = spacy.load('en_use_md')
results in the following error:
/Users/username/.pyenv/versions/3.7.5/lib/python3.7/site-packages/spacy/util.py:275: UserWarning: [W031] Model 'en_use_md' (0.3.1) requires spaCy v2.1,<2 and is incompatible with the current spaCy version (2.3.2). This may lead to unexpected results or runtime errors. To resolve this, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
It then downloads what seems to be an updated version:
Downloaded https://tfhub.dev/google/universal-sentence-encoder/4, Total size: 987.47MB
But, as soon as any data is loading, I get the following error:
File "_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize 'TFHubWrapper' object
(traceback removed for shortness)
Not sure where to go from here...
Hi @ckmaresca,
For the models not showing up in the output of spacy validate
, this is normal. Only the models provided directly from Explosion are shown because it's a check done on this file https://github.com/explosion/spacy-models/blob/master/compatibility.json . This integration of Universal Sentence Encoder is not directly provided by Explosion and therefore cannot appear in that compatibility matrix.
For the second issue, the user warning, the specific model version 0.3.1
was specifically created with spaCy 2.3.2, and is declared as compatible with any spaCy between 2.1 and 2.4 (in https://github.com/MartinoMensio/spacy-universal-sentence-encoder/blob/master/spacy_universal_sentence_encoder/meta/en_use_md.json#L5 and can be seen also in the output of the installation logs).
Despite this, I also see this warning and will try to fix this, thanks for pointing it out. It's a warning you can ignore.
For the third problem, the serialisation, it has been pointed out also in #6. Are you using the docs in a multi-process environment? Doc
objects get serialised and deserialised to pass between processes and therefore spaCy uses packer
to do this, and only supports simple types (while I am using a custom object that wraps the TFHub model).
At the moment, I have problems understanding how to make the extension attributes serializable. If I don't put the TFHubWrapper object in the attributes, I would need a way to retrieve it when the docs are deserialised. I am looking towards a way to solve this issue (I would need to override the to_bytes
ad from_bytes
of the Doc
object).
My suggestion would be, if you can, to use multi-thread instead of multi-process. Are you managing multiple processes with a multiprocessing.pool.Pool
? In this case, you can switch to multiprocessing.pool.ThreadPool
to avoid inter-process communication.
I understand this is not doable if you are using the nlp.pipe
to process batches of texts (using the n_process
argument).
Otherwise if you need multiple processes to be run, at the moment the only solution is to handle encoding outside of spaCy by using the code snippet provided at https://tfhub.dev/google/universal-sentence-encoder/4
If you have experience with spaCy maybe you could give some directions. Serialisation is a big issue I'm looking forward to solving.
Martino
Re: multi-process - not sure, I'm using whatever the default is (I don't know if SpaCy uses this natively or not) as I have not changed any of those settings and am not using multi-process currently. Also not using nlp.pipe
- will try changing to ThreadPool if I can figure out how to do this ;-)
Re: serialization - I'm no SpaCy expert, but I also had issues with serialization (none of the methods in the SpaCy docs work), although in a different context. I need to serialize NLP Docs to a DB to avoid processing things in realtime. Serialization of SpaCy Doc objects took me a long time to figure out as traditional Pyton serialization doesn't work.
I finally figured out out thanks to a few pointers from others. To serialize, I use the following:
nlp = spacy.load(your_model)
nlpObject = nlp("your_text_here")
docBinTemplate = DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=True)
docBinTemplate.add(nlpObject)
docInBytes = docBinTemplate.to_bytes()
serializedNLPobject = codecs.encode(pickle.dumps(docInBytes), "base64").decode()
To un-serialize:
nlpString = rawDBresult.content_as[str] # needs to be a str, equal to serializedNLPobject in code above
nlpBinary = pickle.loads(codecs.decode(nlpString.encode(), 'base64'))
docObject = DocBin().from_bytes(nlpBinary)
nlpObject = list(docObject.get_docs(nlp.vocab)) # result is equal to nlpObject above
Couple of notes:
- My Python code might be crap
- Consider this pseudo-code, not production ready
- Designed and tested to serialize as JSON string to CouchBase, YMMV
- Don't how this might work between languages using IPC
Original solutions from https://stackoverflow.com/questions/49618917/what-is-the-recommended-way-to-serialize-a-collection-of-spacy-docs and https://stackoverflow.com/questions/30469575/how-to-pickle-and-unpickle-to-portable-string-in-python-3