spacy-universal-sentence-encoder icon indicating copy to clipboard operation
spacy-universal-sentence-encoder copied to clipboard

Problems installing models on SpaCy 2.3.2

Open ckmaresca opened this issue 3 years ago • 2 comments

I installed the models as per the directions in the README but they are not showing up in Spacy:

✔ Loaded compatibility table

====================== Installed models (spaCy v2.3.2) ======================
ℹ spaCy installation:
/Users/ckm/.pyenv/versions/3.7.5/lib/python3.7/site-packages/spacy

TYPE      NAME                             MODEL                            VERSION                            
package   en-vectors-web-lg                en_vectors_web_lg                2.3.0   ✔
package   en-trf-xlnetbasecased-lg         en_trf_xlnetbasecased_lg         2.3.0   ✔
package   en-trf-robertabase-lg            en_trf_robertabase_lg            2.3.0   ✔
package   en-trf-distilbertbaseuncased-lg   en_trf_distilbertbaseuncased_lg   2.3.0   ✔
package   en-trf-bertbaseuncased-lg        en_trf_bertbaseuncased_lg        2.3.0   ✔
package   en-core-web-sm                   en_core_web_sm                   2.3.1   ✔
package   en-core-web-lg                   en_core_web_lg                   2.3.1   ✔

I'm installing using: pip install https://github.com/MartinoMensio/spacy-universal-sentence-encoder/releases/download/v0.3.1/en_use_md-0.3.1.tar.gz#en_use_md-0.3.1 and pip install https://github.com/MartinoMensio/spacy-universal-sentence-encoder/releases/download/v0.3.1/en_use_lg-0.3.1.tar.gz#en_use_lg-0.3.1 both of which complete successfully.

But calling nlp = spacy.load('en_use_md') results in the following error:

/Users/username/.pyenv/versions/3.7.5/lib/python3.7/site-packages/spacy/util.py:275: UserWarning: [W031] Model 'en_use_md' (0.3.1) requires spaCy v2.1,<2 and is incompatible with the current spaCy version (2.3.2). This may lead to unexpected results or runtime errors. To resolve this, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate

It then downloads what seems to be an updated version: Downloaded https://tfhub.dev/google/universal-sentence-encoder/4, Total size: 987.47MB

But, as soon as any data is loading, I get the following error:

File "_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize 'TFHubWrapper' object

(traceback removed for shortness)

Not sure where to go from here...

ckmaresca avatar Aug 12 '20 16:08 ckmaresca

Hi @ckmaresca,

For the models not showing up in the output of spacy validate, this is normal. Only the models provided directly from Explosion are shown because it's a check done on this file https://github.com/explosion/spacy-models/blob/master/compatibility.json . This integration of Universal Sentence Encoder is not directly provided by Explosion and therefore cannot appear in that compatibility matrix.

For the second issue, the user warning, the specific model version 0.3.1 was specifically created with spaCy 2.3.2, and is declared as compatible with any spaCy between 2.1 and 2.4 (in https://github.com/MartinoMensio/spacy-universal-sentence-encoder/blob/master/spacy_universal_sentence_encoder/meta/en_use_md.json#L5 and can be seen also in the output of the installation logs). Despite this, I also see this warning and will try to fix this, thanks for pointing it out. It's a warning you can ignore.

For the third problem, the serialisation, it has been pointed out also in #6. Are you using the docs in a multi-process environment? Doc objects get serialised and deserialised to pass between processes and therefore spaCy uses packer to do this, and only supports simple types (while I am using a custom object that wraps the TFHub model). At the moment, I have problems understanding how to make the extension attributes serializable. If I don't put the TFHubWrapper object in the attributes, I would need a way to retrieve it when the docs are deserialised. I am looking towards a way to solve this issue (I would need to override the to_bytes ad from_bytes of the Doc object). My suggestion would be, if you can, to use multi-thread instead of multi-process. Are you managing multiple processes with a multiprocessing.pool.Pool? In this case, you can switch to multiprocessing.pool.ThreadPool to avoid inter-process communication. I understand this is not doable if you are using the nlp.pipe to process batches of texts (using the n_process argument). Otherwise if you need multiple processes to be run, at the moment the only solution is to handle encoding outside of spaCy by using the code snippet provided at https://tfhub.dev/google/universal-sentence-encoder/4

If you have experience with spaCy maybe you could give some directions. Serialisation is a big issue I'm looking forward to solving.

Martino

MartinoMensio avatar Aug 12 '20 17:08 MartinoMensio

Re: multi-process - not sure, I'm using whatever the default is (I don't know if SpaCy uses this natively or not) as I have not changed any of those settings and am not using multi-process currently. Also not using nlp.pipe - will try changing to ThreadPool if I can figure out how to do this ;-)

Re: serialization - I'm no SpaCy expert, but I also had issues with serialization (none of the methods in the SpaCy docs work), although in a different context. I need to serialize NLP Docs to a DB to avoid processing things in realtime. Serialization of SpaCy Doc objects took me a long time to figure out as traditional Pyton serialization doesn't work.

I finally figured out out thanks to a few pointers from others. To serialize, I use the following:

nlp = spacy.load(your_model)
nlpObject = nlp("your_text_here")
docBinTemplate = DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=True)
docBinTemplate.add(nlpObject)
docInBytes = docBinTemplate.to_bytes()
serializedNLPobject = codecs.encode(pickle.dumps(docInBytes), "base64").decode()

To un-serialize:

nlpString = rawDBresult.content_as[str] # needs to be a str, equal to serializedNLPobject in code above
nlpBinary = pickle.loads(codecs.decode(nlpString.encode(), 'base64'))
docObject = DocBin().from_bytes(nlpBinary)
nlpObject = list(docObject.get_docs(nlp.vocab))	# result is equal to nlpObject above

Couple of notes:

  • My Python code might be crap
  • Consider this pseudo-code, not production ready
  • Designed and tested to serialize as JSON string to CouchBase, YMMV
  • Don't how this might work between languages using IPC

Original solutions from https://stackoverflow.com/questions/49618917/what-is-the-recommended-way-to-serialize-a-collection-of-spacy-docs and https://stackoverflow.com/questions/30469575/how-to-pickle-and-unpickle-to-portable-string-in-python-3

ckmaresca avatar Aug 12 '20 18:08 ckmaresca