spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

NER resizing error in combination with entity rulers

Open svlandeg opened this issue 2 years ago • 1 comments

There's most likely a bug in spaCy v2 around the NER resizing, entity rulers in the pipeline, and IO+(re)initialization.

Background in this discussion thread: https://github.com/explosion/spaCy/discussions/8864. There are several entity rulers and NER models in the pipeline. At a certain point, when the entity ruler is deserialized from disk and calls nlp.pipe when adding its patterns internally, the NER tries to resize itself and gets stuck in an inconsistent state with incompatible network sizes:

(...)
  File "C:\Users\smisr\PycharmProjects\june_2021\virtualenv\lib\site-packages\spacy\util.py", line 690, in from_disk
    reader(path / key)
  File "C:\Users\smisr\PycharmProjects\june_2021\virtualenv\lib\site-packages\spacy\pipeline\entityruler.py", line 344, in <lambda>
    "patterns": lambda p: self.add_patterns(
  File "C:\Users\smisr\PycharmProjects\june_2021\virtualenv\lib\site-packages\spacy\pipeline\entityruler.py", line 226, in add_patterns
    for label, pattern, ent_id in zip(
  File "C:\Users\smisr\PycharmProjects\june_2021\virtualenv\lib\site-packages\spacy\language.py", line 829, in pipe
    for doc in docs:
(...)
  File "nn_parser.pyx", line 274, in spacy.syntax.nn_parser.Parser.predict
(...)
  File "_parser_model.pyx", line 269, in spacy.syntax._parser_model.ParserModel.resize_output
ValueError: could not broadcast input array from shape (1258,64) into shape (1226,64)

I'm not sure if this behaviour also occurs in spaCy v3. It feels like a niche case though, and for now there is a usable work-around (add all entity ruler labels to the NER up-front). Logging here for future reference in case we happen to stumble upon similar problems.

I don't think this is very high priority, but if want to debug & fix this, we'll need a reproducible code snippet first.

svlandeg avatar Aug 10 '21 12:08 svlandeg

The problem has occured again. The work-around suggested worked only when the labels while retraining were the same as those added when the model was first trained. When I tried to retrain the model with a mixture of records with old and new labels, I received the same error.

I have not been able to create a reproducible snippet as this seems to happen when there are many labels that are added. In this retraining, 11,000 records with around 300 new labels were used to retrain the model.

smisra1 avatar Oct 22 '21 13:10 smisra1