spaCy `Spacy` has inconsistency when dividing sentences

Hello,

I am using Spacy to divide sentences after joining a set of words with whitespaces. But to my dismay, this process has unpredictable and unexplainable behaviour. I have a custom segmentation function where I am trying to set custom sentence boundaries (ie is_sent_start).

Custom Function:

from spacy.language import Language

@Language.component("segm")
def set_custom_segmentation(doc):
    i = 0
    while i < len(doc[:-1]):
        if doc[i].text.lower() in ["eq", "fig", "al", 'table', "fig."]:
            doc[i+1].is_sent_start = False
            i+=1
        elif doc[i].text in ["(", "'s"]:
            doc[i].is_sent_start = False
            i+=1
        elif doc[i].text in [".", ")."]:
            doc[i+1].is_sent_start = True
        else:
            doc[i+1].is_sent_start = False
        i+=1
    return doc

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("segm", before="parser")
nlp.pipeline

This is my nlp.pipeline.

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x29f4c3ee0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x29f4c3f40>),
 ('segm', <function __main__.set_custom_segmentation(doc)>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x29f8380b0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x29e3ee4c0>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x29dd1f100>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x29f7f7f40>)]

How to reproduce the behaviour

doc = nlp("Massive ETGs are summarized in a schematic way in Fig. 2 . ##(this is the sentence to consider)## We refer the reader to fig. 1 of Forbes et al. ( 2011 ) and fig. 10 of Faifer et al. ( 2011 ) for real-world examples of our schematic plot, which show not only the mean gradients but also the individual GC data points. Figure 2.")

for sent in doc.sents:
    print(sent)

This is the current output: Screenshot 2024-02-22 at 13 50 02

The form of the tokens here Fig. 2 . produces different outputs for the sentences. Please see the following examples.

Here if we change the Massive ETGs are summarized in a schematic way in Fig. 21 . (changed 2 . to 21 . )
Here if we change the Massive ETGs are summarized in a schematic way in Fig. 21. (removed space b/w 21 & period)
Here if we change the Massive ETGs are summarized in a schematic way in Fig. 2. (removed space b/w 2 & period)
Here if we change the Massive ETGs are summarized in a schematic way in Fig. 1 . (changed 2 to 1)
Here if we change the Massive ETGs are summarized in a schematic way in Fig. 3 . (changed 2 to 3)
Here if we change the Massive ETGs are summarized in a schematic way in Fig. 4 . (changed 2 to 4)
Here if we change the Massive ETGs are summarized in a schematic way in Fig. 4. (changed 2 to 4 and removed whitespace)
Here if we change the Massive ETGs are summarized in a schematic way in Fig. 200. (changed 2 to 200 and removed space)
Here if we change the Massive ETGs are summarized in a schematic way in Fig. 200 . (changed 2 to 200)

There is inconsistent behaviour in the way the sentence boundaries are categorised here. I have other examples as well so if needed I can share them here.

Any help in understanding this would be appreciated.

Your Environment

spaCy version: 3.6.0
Platform: macOS-14.3.1-arm64-arm-64bit
Python version: 3.10.12
Pipelines: en_core_web_lg (3.6.0), en_core_web_sm (3.6.0)

Feb 22 '24 19:02 DhruvSondhi

One issue you might be running into is that the dependency parser is responsible for finding and setting sentence boundaries in the pretrained spaCy pipelines:

https://spacy.io/api/dependencyparser#assigned-attributes

If you have your own pipe that sets boundaries, you may want to run this pipe after the dependency parser for this reason. Could you try to see if this improves things for you?

Feb 22 '24 19:02 danieldk

Hello @danieldk,

Thank you for your response on the issue. I tried your suggestion and moved the custom segmentation function after the parser in the nlp.pipeline. But I am facing an error.

ValueError: [E043] Refusing to write to token.sent_start if its document is parsed, because this may cause inconsistent state.

I don't think this will work as the parsing being done here interferes with the custom segmentation boundaries that I require to be set due to certain edge cases such as Fig., eg., etc.

I saw another issue here: #3569. Similar issue.

Feb 22 '24 20:02 DhruvSondhi

I got a bit different, but similar issue

spacy == 3.7.4, mac


In [69]: len(list(spacy.load("en_core_web_trf")("The first sentence. The second sentence. The last one").sents))
Out[69]: 1   <<<<<<<<<<<<<<<<<<<<<< WRONG

In [70]: len(list(spacy.load("en_core_web_trf")("The first sentence. The second sentence. The last one.").sents))
Out[70]: 3

In [71]: len(list(spacy.load("en_core_web_sm")("The first sentence. The second sentence. The last one").sents))
Out[71]: 3

In [72]: len(list(spacy.load("en_core_web_sm")("The first sentence. The second sentence. The last one.").sents))
Out[72]: 3

Feb 24 '24 21:02 koder-ua

I got a bit different, but similar issue

This is a different question. Could you open a topic on the discussion forum?

Feb 25 '24 19:02 danieldk

Thank you for your response on the issue. I tried your suggestion and moved the custom segmentation function after the parser in the nlp.pipeline. But I am facing an error.

Ah right, sorry, I overlooked that. The issue with changing the boundaries after parsing is that it could result in dependency relations that cross sentence boundaries, which is one of the reasons why we disallow this. We'll have to look into this more deeply, because the parser should in principle respect boundaries that were set earlier. Also see

https://github.com/explosion/spaCy/discussions/11107 https://github.com/explosion/spaCy/issues/7716

for more background.

Feb 25 '24 19:02 danieldk

spaCy spaCy copied to clipboard

`Spacy` has inconsistency when dividing sentences

How to reproduce the behaviour

Your Environment

spaCy
spaCy copied to clipboard