spaCy
spaCy copied to clipboard
Suffix doesn't match for sentence ending in uppercase.
How to reproduce the behaviour
import spacy
nlp = spacy.load("en_core_web_sm")
list(nlp.tokenizer("about the P&L."))
I get
[about, the, P&L.]
The . should be separated from P&L here.
This behaviour comes from, https://github.com/explosion/spaCy/blob/bf778f59c7ea48787ef4aac79ca2f1e33fe33e08/spacy/lang/punctuation.py#L33
the requirement for double uppercase is likely for acronyms but perhaps an ampersand is acceptable.
eg r"(?<=&[{au}])\.".format(au=ALPHA_UPPER)
Your Environment
- spaCy version: 2.3.2
- Platform: Darwin-19.6.0-x86_64-i386-64bit
- Python version: 3.6.12
Yes, I see your point. I think you'd want to add the rule
r"(?<=[{au}]&[{au}])\.".format(au=ALPHA_UPPER),
to the _suffixes
. Unfortunately you can't just put an optional &
in the existing rule, because the look-behind can't be variable-width.
If you're training a custom model, you could modify this behaviour for your own custom tokenizer, cf https://spacy.io/usage/linguistic-features#native-tokenizer-additions. You could also replace the tokenizer of a pretrained model with your own custom tokenizer, though that may impact accuracy slightly (though maybe not so much in this case).
We're typically hesitant to change the punctuation rules in the core library though, because there may be unwanted side effects, especially when changing the lang/punctuation.py
file that is used as base for many other languages. On spaCy's develop
branch, we have a specific punctuation file for English, https://github.com/explosion/spaCy/blob/develop/spacy/lang/en/punctuation.py, where we could consider adding this change for English only.
I've been trying to think of "bad" consequences of adding your proposed "ampersand" rule to the English tokenizer and can't immediately think of one. I'm less sure about other languages. Would be interested to hear what my colleagues think - e.g. @adrianeboyd ?
I can't think of anything major, but to be on the safe side we should test it with all the internal training corpora. Let me see...
I am experiencing a similar behavior with the German word "GmbH".
nlp = spacy.lang.de.German()
[tok for tok in nlp("Herr Bert ist Geschäftsführer der Ernie GmbH.")]
Results in
[Herr, Bert, ist, Geschäftsführer, der, Ernie, GmbH.]
I followed the example above and added a specific rule to _suffixes