spaCy
spaCy copied to clipboard
Handle sentence boundaries from multiple components
Feature description
Decide how to handle is_sentenced and sentence boundaries that may come from multiple components (Sentencizer, SentenceRecognizer, Parser).
Some ideas:
- have an
is_sentencedproperty more likeis_parsedthat can be set by components - have a way to set finalized sentence boundaries (all 0 to -1):
- have an extra option for each component
- have an extra pipeline component (e.g.,
finalize_sentences?) that can be inserted at the right point in the pipeline
- also have a component that resets all sentence boundaries?
- modify Sentencizer to only set sentence starts, not all tokens?
Check that no spacy components clobber sentence boundaries and that is_sentenced works consistently when sentence boundaries come from multiple sources. If a component after the parser changes sentence boundaries, make sure the required tree recalculations are done (a related issue: #4497).
Potentially add warnings when non-zero sent_start is changed by any component?
I think the default behavior could be that any pipeline component can add sentence boundaries but that components won't remove any sentence boundaries. The idea would be that the Sentencizer or SentenceRecognizer add punctuation-based boundaries (typically high precision, although the Sentencizer less so) and the Parser can add phrase-based boundaries (improving recall). I don't know if this works as cleanly as envisioned in practice, especially with the Sentencizer. Most likely people using the Sentencizer aren't using other components so it's less of an issue, but I could imagine SentenceRecognizer + Parser as a common combination.
Suggestions from @DomHudson in https://github.com/explosion/spaCy/issues/5050#issuecomment-590235869:
In my opinion the combination of
{None, True, False}is not transparent or flexible enough to provide the information that it is currently trying to captured. It is likely to cause problems as it would be entirely reasonable to expectTrueto indicate a sentence boundary andFalseotherwise - a clean API should be self-explanatory.I think the best approach is to have this attribute as a boolean (no None-types allowed) once the sentence boundaries have been set and None-type otherwise. If there is a desire to allow more complex stacking of pipelines and pipeline-units then a more complete history should be kept, for example a
SentenceBoundaryBooleanobject could be created which mimicsTrueorFalsebut also allows the state of certain tokens to be altered after their initial creation and retains the history of the model that caused the latest change. This would provide much more flexibility and explainability than the limited{True, False, None}.
Related conversations:
- Issue #5578 about DocBin serialization and which attributes to use
- Issue #5050 about the ternary system (true, false, none) for
is_sent_startand how different components should set the values - Issue #5287 about how the Matcher accesses the "ternary boolean" sentence boundary values
My position on this is mostly "keep it as is". I'm open to debate on this, but I'll explain my position.
I agree that an is_sentenced flag would be good. I'm happy to accommodate that.
I still think the ternary values are the most practical mechanism for allowing components to coordinate on the sentence boundaries. I don't think it would really help to have something like a decision history or something, and that would be impractical for efficiency reasons anyway.
There's ultimately no way for components to know what other components are expected to run before or after them. It's up to the pipeline author to construct a pipeline that behaves well as a whole. It's nice if components are configurable about how they set the sentence boundary values, but that's a question for the design of the individual components. And the pipeline author can always insert other processes that run over the Doc and set the boundaries differently.
I don't think any more complicated mechanism than ternary values would really help components coordinate. Let's say components got to set a single probability instead of a ternary. If you're writing a component and you receive some set probability, how should you interpret it? It will depend on how accurate you expect that model to be on your data, and how accurate you expect the component's own model to be. Only the person who puts together the pipeline is in a position to know how those values should be integrated, so it still can't happen automatically. Similarly, let's say you had a full history of which components had set the is_sent_start values, and what decisions they had made. If you had that, what would you do with it? You still don't know what the correct value should be.
So my position is that components are able to set three values for the is_sent_start attribute on each token: True, False, and None. Components should try to do a good job setting this value, and any component can choose to respect or ignore the previous decisions. Components will be more useful if they tell users what they do and allow that to be configured, but that's ultimately up to the component. And ultimately it's up to the pipeline author to construct a sequence of components that give useful results.
For ourselves as pipeline and component authors, I think the parser could be a bit more configurable. We could expose an option to never insert sentence boundaries, regardless of whether False or None were set. We can currently get that behaviour by setting all the is_sent_start values to False, but that overrides the previous values which might be undesirable. Personally I think this isn't that useful a configuration though, and I don't know what I'd call it.