spaCy IndexError E040 when using senter

IndexError E040 when using senter

Open rutgerjv opened this issue 2 years ago • 3 comments

How to reproduce the behaviour

As suggested on this page (https://spacy.io/models) I can replace the "parser" by the "senter" as a more efficient way to detect sentence boundaries. I did so, but ran into an E040 error when printing the individual noun_chunks of the document (which is not happening when using the original parser).

Code to reproduce:

nlp = spacy.load("en_core_web_sm")
nlp.disable_pipe('parser')
nlp.enable_pipe('senter')
doc = nlp(text)

for chunk in doc.noun_chunks:
    print(chunk)

Error message:

  File "spacy\tokens\token.pyx", line 609, in spacy.tokens.token.Token.left_edge.__get__
  File "spacy\tokens\doc.pyx", line 474, in spacy.tokens.doc.Doc.__getitem__
  File "spacy\tokens\token.pxd", line 23, in spacy.tokens.token.Token.cinit
IndexError: [E040] Attempt to access token at 10794, max length 10792.

Your Environment

spaCy version: 3.2.3
Platform: Windows-10-10.0.19042-SP0
Python version: 3.9.7
Pipelines: en_core_web_lg (3.2.0), en_core_web_md (3.2.0), en_core_web_sm (3.2.0), en_core_web_trf (3.2.0)

May 13 '22 15:05 rutgerjv

That does sound like a bug, but I can't reproduce it from the info above. When you disable the parser and then try to use noun_chunks, you should get this error:

ValueError: [E029] `noun_chunks` requires the dependency parse, which requires a statistical model to be installed and loaded. For more info, see the documentation:
https://spacy.io/usage/models

Do you have a minimal example including the text that fails? Please format code and error messages using code blocks, by using three backticks on a separate line before and after the code: https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks

May 16 '22 06:05 adrianeboyd

Here's the code to reproduce including an excerpt from the text (part of a section with scientific references) on which the error occurs:

text = "016/j. foodr es. 2017. 07. 018 (2017).\n\n 27.  Aluko, R."
nlp = spacy.load("en_core_web_sm")
nlp.disable_pipe('parser')
nlp.enable_pipe('senter')
doc = nlp(text)

for chunk in doc.noun_chunks:
    print(chunk)

May 16 '22 10:05 rutgerjv

Ah, I assumed that since nothing has changed with noun chunks and tested with v3.3.0, which does work.

This is related to #10003 and has to do with rules in the v3.2.0 models that handle whitespace tokens.

The problem is the whitespace in the text, where the v3.2.0 model (model version, not spacy version) adds dependencies for the SPACE tokens like dep for \n\n and then the noun chunks functions is confused about whether it's parsed. This partial parse leads to the error.

So upgrading to en_core_web_sm v3.3.0 should fix this particular problem with this text because the rules have been improved. We should consider what to do about the noun chunks iterators that fail badly here given partial parses.

May 16 '22 17:05 adrianeboyd

spaCy spaCy copied to clipboard

IndexError E040 when using senter

How to reproduce the behaviour

Your Environment

spaCy
spaCy copied to clipboard