spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Parser doesn't respect preset sentence boundaries in some cases

Open polm opened this issue 3 years ago • 7 comments

This is for the issue found in https://github.com/explosion/spaCy/discussions/7564.

How to reproduce the behaviour

Given a sentence, set is_sent_start to False in some but not all of the tokens before it gets to the parser.

Example sentence:

According to the sections (1) and (2) of this page.

Example settings:

According True
to None
the None
sections None
( True
1 None
) None
and False
( False
2 None
) None
of False
this None
page None
. None

In this case the parser will contradict the False settings. However, if you move them one down to the next token, they will be respected. This seems like an off-by-one error in the parser somewhere.

Your Environment

  • Operating System: Linux
  • Python Version Used:
  • spaCy Version Used: 3.0.5
  • Environment Information:

polm avatar Apr 09 '21 09:04 polm

Minimal reproduction code for this issue:

import spacy
from spacy.language import Language


@Language.component("special_split")
def special_split(doc):
    # Note this code is not robust with handling indices
    for ii, tok in enumerate(doc):
        if ii == 0:
            tok.is_sent_start = True
        if tok.text == "(":
            tok.is_sent_start = True
        if ii + 1 < len(doc) and doc[ii+1].text == "2":
            tok.is_sent_start = False
        if tok.text == ")":
            doc[ii+1].is_sent_start = False
        print(tok, tok.is_sent_start, sep="\t")
    return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("special_split", before="parser")

text = "According to the sections (1) and (2) of this page."

doc = nlp(text)
for sent in doc.sents:
    print(sent)

Wrong output, because it overrules the user is_sent_start:

According to the sections
(1) and
(2) of this page.

Desired output (other valid outputs possible):

According to the sections
(1) and (2) of this page.

However, under v3.1.1 the output is the desired output above, not the bad output. This could happen due to small changes in the model causing the model predictions to align with the user settings by happy accident. I'll see if I can reproduce the bug by modifying the sentence a bit.

polm avatar Aug 17 '21 10:08 polm

Using the 3.0 model with 3.1 the wrong sentence splits still show up so it looks like the "fix" with 3.1 is a happy accident. Also note these tests are all with the small English model. I have not been able to make other sentences that show this issue.

There aren't many places where the sent starts are set, so this isn't very specific information, but I was able to track down the problematic change to set_children_from_heads. It seems that when the parser is making sure that heads are in the right sentences it doesn't pay any attention to existing sentence boundaries. I am not sure where in the parser it makes sure that single sentences don't have multiple heads but I guess it's missing something there.

polm avatar Aug 18 '21 04:08 polm

I have a case where this happens for en_core_web_lg, this is spacy ver.3.1.3 and model ver.3.1.0:

from spacy.language import Language
import spacy

nlp = spacy.load('en_core_web_lg')
text = """
Early sociologists include Jim Henson, Sarah Smith, Albert Johnson, Nelson Mandela, Mahatma Ghandi, Samsonite Jones, and W. E. B. Du Bois. Pp. 10–111.
"""

print(nlp.component_names)

doc = nlp(text)

print(list(doc.sents))
print(len(list(doc.sents)))


@Language.component( '_custom_sbd' )
def _custom_sbd( doc ):
    for tok in doc:
        # tok.is_sent_start = True
        if 'W.' in tok.text:
            print(tok)
            tok.is_sent_start = False
    return doc


nlp.add_pipe('_custom_sbd', before='parser')
print(nlp.component_names)

doc = nlp(text)

print(list(doc.sents))
print(len(list(doc.sents)))

you can follow a debugger/trace into Language.call() and watch where the custom_sbd sets the sentence boundary to False, followed by parser setting it back to True.

Thanks for the report and debugging info!

polm avatar Oct 20 '21 04:10 polm

@polm no problem! If there's any more info I can provide that would be helpful, please let me know.

I'm also hitting this issue after updating spaCy from v2 to v3. Are there any workarounds until it is fixed in the parser? Unfortunately it's a major issue in our use case.

itssimon avatar Nov 08 '22 11:11 itssimon

Sorry this is a major issue for you. Just to check something, this is happening because you're using partial sentence annotations? If you have some examples we could check, especially of short sentences, it might be helpful, just in case it brings up a pattern we haven't seen before. It would also be helpful to know why you need this - for example, do you also have legal text with unusual formatting, like the original reporter, or is it something else?

As far as workarounds, a couple come to mind, though none are great.

  • can you set all sentence boundaries instead of some?
  • rather than setting boundaries before the parser runs, set them afterwards
  • try setting both before and after the parser runs, which might do a better job directing the parser

polm avatar Nov 09 '22 04:11 polm