spaCy
spaCy copied to clipboard
`Spacy` has inconsistency when dividing sentences
Hello,
I am using Spacy
to divide sentences after joining a set of words with whitespaces. But to my dismay, this process has unpredictable and unexplainable behaviour. I have a custom segmentation function where I am trying to set custom sentence boundaries (ie is_sent_start
).
Custom Function:
from spacy.language import Language
@Language.component("segm")
def set_custom_segmentation(doc):
i = 0
while i < len(doc[:-1]):
if doc[i].text.lower() in ["eq", "fig", "al", 'table', "fig."]:
doc[i+1].is_sent_start = False
i+=1
elif doc[i].text in ["(", "'s"]:
doc[i].is_sent_start = False
i+=1
elif doc[i].text in [".", ")."]:
doc[i+1].is_sent_start = True
else:
doc[i+1].is_sent_start = False
i+=1
return doc
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("segm", before="parser")
nlp.pipeline
This is my nlp.pipeline
.
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x29f4c3ee0>),
('tagger', <spacy.pipeline.tagger.Tagger at 0x29f4c3f40>),
('segm', <function __main__.set_custom_segmentation(doc)>),
('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x29f8380b0>),
('attribute_ruler',
<spacy.pipeline.attributeruler.AttributeRuler at 0x29e3ee4c0>),
('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x29dd1f100>),
('ner', <spacy.pipeline.ner.EntityRecognizer at 0x29f7f7f40>)]
How to reproduce the behaviour
doc = nlp("Massive ETGs are summarized in a schematic way in Fig. 2 . ##(this is the sentence to consider)## We refer the reader to fig. 1 of Forbes et al. ( 2011 ) and fig. 10 of Faifer et al. ( 2011 ) for real-world examples of our schematic plot, which show not only the mean gradients but also the individual GC data points. Figure 2.")
for sent in doc.sents:
print(sent)
This is the current output:
The form of the tokens here Fig. 2 .
produces different outputs for the sentences. Please see the following examples.
-
Here if we change the
Massive ETGs are summarized in a schematic way in Fig. 21 .
(changed 2 . to 21 . ) -
Here if we change the
Massive ETGs are summarized in a schematic way in Fig. 21.
(removed space b/w 21 & period) -
Here if we change the
Massive ETGs are summarized in a schematic way in Fig. 2.
(removed space b/w 2 & period) -
Here if we change the
Massive ETGs are summarized in a schematic way in Fig. 1 .
(changed 2 to 1) -
Here if we change the
Massive ETGs are summarized in a schematic way in Fig. 3 .
(changed 2 to 3) -
Here if we change the
Massive ETGs are summarized in a schematic way in Fig. 4 .
(changed 2 to 4) -
Here if we change the
Massive ETGs are summarized in a schematic way in Fig. 4.
(changed 2 to 4 and removed whitespace) -
Here if we change the
Massive ETGs are summarized in a schematic way in Fig. 200.
(changed 2 to 200 and removed space) -
Here if we change the
Massive ETGs are summarized in a schematic way in Fig. 200 .
(changed 2 to 200)
There is inconsistent behaviour in the way the sentence boundaries are categorised here. I have other examples as well so if needed I can share them here.
Any help in understanding this would be appreciated.
Your Environment
- spaCy version: 3.6.0
- Platform: macOS-14.3.1-arm64-arm-64bit
- Python version: 3.10.12
- Pipelines: en_core_web_lg (3.6.0), en_core_web_sm (3.6.0)
One issue you might be running into is that the dependency parser is responsible for finding and setting sentence boundaries in the pretrained spaCy pipelines:
https://spacy.io/api/dependencyparser#assigned-attributes
If you have your own pipe that sets boundaries, you may want to run this pipe after the dependency parser for this reason. Could you try to see if this improves things for you?
Hello @danieldk,
Thank you for your response on the issue. I tried your suggestion and moved the custom segmentation function after the parser
in the nlp.pipeline
. But I am facing an error.
ValueError: [E043] Refusing to write to token.sent_start if its document is parsed, because this may cause inconsistent state.
I don't think this will work as the parsing being done here interferes with the custom segmentation boundaries that I require to be set due to certain edge cases such as Fig.
, eg.
, etc.
I saw another issue here: #3569. Similar issue.
I got a bit different, but similar issue
spacy == 3.7.4, mac
In [69]: len(list(spacy.load("en_core_web_trf")("The first sentence. The second sentence. The last one").sents))
Out[69]: 1 <<<<<<<<<<<<<<<<<<<<<< WRONG
In [70]: len(list(spacy.load("en_core_web_trf")("The first sentence. The second sentence. The last one.").sents))
Out[70]: 3
In [71]: len(list(spacy.load("en_core_web_sm")("The first sentence. The second sentence. The last one").sents))
Out[71]: 3
In [72]: len(list(spacy.load("en_core_web_sm")("The first sentence. The second sentence. The last one.").sents))
Out[72]: 3
I got a bit different, but similar issue
This is a different question. Could you open a topic on the discussion forum?
Thank you for your response on the issue. I tried your suggestion and moved the custom segmentation function after the parser in the nlp.pipeline. But I am facing an error.
Ah right, sorry, I overlooked that. The issue with changing the boundaries after parsing is that it could result in dependency relations that cross sentence boundaries, which is one of the reasons why we disallow this. We'll have to look into this more deeply, because the parser should in principle respect boundaries that were set earlier. Also see
https://github.com/explosion/spaCy/discussions/11107 https://github.com/explosion/spaCy/issues/7716
for more background.