spaCy
spaCy copied to clipboard
Parser doesn't respect preset sentence boundaries in some cases
This is for the issue found in https://github.com/explosion/spaCy/discussions/7564.
How to reproduce the behaviour
Given a sentence, set is_sent_start
to False in some but not all of the tokens before it gets to the parser.
Example sentence:
According to the sections (1) and (2) of this page.
Example settings:
According True
to None
the None
sections None
( True
1 None
) None
and False
( False
2 None
) None
of False
this None
page None
. None
In this case the parser will contradict the False settings. However, if you move them one down to the next token, they will be respected. This seems like an off-by-one error in the parser somewhere.
Your Environment
- Operating System: Linux
- Python Version Used:
- spaCy Version Used: 3.0.5
- Environment Information:
Minimal reproduction code for this issue:
import spacy
from spacy.language import Language
@Language.component("special_split")
def special_split(doc):
# Note this code is not robust with handling indices
for ii, tok in enumerate(doc):
if ii == 0:
tok.is_sent_start = True
if tok.text == "(":
tok.is_sent_start = True
if ii + 1 < len(doc) and doc[ii+1].text == "2":
tok.is_sent_start = False
if tok.text == ")":
doc[ii+1].is_sent_start = False
print(tok, tok.is_sent_start, sep="\t")
return doc
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("special_split", before="parser")
text = "According to the sections (1) and (2) of this page."
doc = nlp(text)
for sent in doc.sents:
print(sent)
Wrong output, because it overrules the user is_sent_start
:
According to the sections
(1) and
(2) of this page.
Desired output (other valid outputs possible):
According to the sections
(1) and (2) of this page.
However, under v3.1.1 the output is the desired output above, not the bad output. This could happen due to small changes in the model causing the model predictions to align with the user settings by happy accident. I'll see if I can reproduce the bug by modifying the sentence a bit.
Using the 3.0 model with 3.1 the wrong sentence splits still show up so it looks like the "fix" with 3.1 is a happy accident. Also note these tests are all with the small English model. I have not been able to make other sentences that show this issue.
There aren't many places where the sent starts are set, so this isn't very specific information, but I was able to track down the problematic change to set_children_from_heads
. It seems that when the parser is making sure that heads are in the right sentences it doesn't pay any attention to existing sentence boundaries. I am not sure where in the parser it makes sure that single sentences don't have multiple heads but I guess it's missing something there.
I have a case where this happens for en_core_web_lg
, this is spacy ver.3.1.3 and model ver.3.1.0:
from spacy.language import Language
import spacy
nlp = spacy.load('en_core_web_lg')
text = """
Early sociologists include Jim Henson, Sarah Smith, Albert Johnson, Nelson Mandela, Mahatma Ghandi, Samsonite Jones, and W. E. B. Du Bois. Pp. 10–111.
"""
print(nlp.component_names)
doc = nlp(text)
print(list(doc.sents))
print(len(list(doc.sents)))
@Language.component( '_custom_sbd' )
def _custom_sbd( doc ):
for tok in doc:
# tok.is_sent_start = True
if 'W.' in tok.text:
print(tok)
tok.is_sent_start = False
return doc
nlp.add_pipe('_custom_sbd', before='parser')
print(nlp.component_names)
doc = nlp(text)
print(list(doc.sents))
print(len(list(doc.sents)))
you can follow a debugger/trace into Language.call() and watch where the custom_sbd
sets the sentence boundary to False, followed by parser
setting it back to True.
Thanks for the report and debugging info!
@polm no problem! If there's any more info I can provide that would be helpful, please let me know.
I'm also hitting this issue after updating spaCy from v2 to v3. Are there any workarounds until it is fixed in the parser? Unfortunately it's a major issue in our use case.
Sorry this is a major issue for you. Just to check something, this is happening because you're using partial sentence annotations? If you have some examples we could check, especially of short sentences, it might be helpful, just in case it brings up a pattern we haven't seen before. It would also be helpful to know why you need this - for example, do you also have legal text with unusual formatting, like the original reporter, or is it something else?
As far as workarounds, a couple come to mind, though none are great.
- can you set all sentence boundaries instead of some?
- rather than setting boundaries before the parser runs, set them afterwards
- try setting both before and after the parser runs, which might do a better job directing the parser