spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Suffix doesn't match for sentence ending in uppercase.

Open jdupl123 opened this issue 3 years ago • 3 comments

How to reproduce the behaviour

import spacy
nlp = spacy.load("en_core_web_sm")
list(nlp.tokenizer("about the P&L."))

I get

[about, the, P&L.]

The . should be separated from P&L here.

This behaviour comes from, https://github.com/explosion/spaCy/blob/bf778f59c7ea48787ef4aac79ca2f1e33fe33e08/spacy/lang/punctuation.py#L33

the requirement for double uppercase is likely for acronyms but perhaps an ampersand is acceptable.

eg r"(?<=&[{au}])\.".format(au=ALPHA_UPPER)

Your Environment

  • spaCy version: 2.3.2
  • Platform: Darwin-19.6.0-x86_64-i386-64bit
  • Python version: 3.6.12

jdupl123 avatar Jan 08 '21 04:01 jdupl123

Yes, I see your point. I think you'd want to add the rule

r"(?<=[{au}]&[{au}])\.".format(au=ALPHA_UPPER),

to the _suffixes. Unfortunately you can't just put an optional & in the existing rule, because the look-behind can't be variable-width.

If you're training a custom model, you could modify this behaviour for your own custom tokenizer, cf https://spacy.io/usage/linguistic-features#native-tokenizer-additions. You could also replace the tokenizer of a pretrained model with your own custom tokenizer, though that may impact accuracy slightly (though maybe not so much in this case).

We're typically hesitant to change the punctuation rules in the core library though, because there may be unwanted side effects, especially when changing the lang/punctuation.py file that is used as base for many other languages. On spaCy's develop branch, we have a specific punctuation file for English, https://github.com/explosion/spaCy/blob/develop/spacy/lang/en/punctuation.py, where we could consider adding this change for English only.

I've been trying to think of "bad" consequences of adding your proposed "ampersand" rule to the English tokenizer and can't immediately think of one. I'm less sure about other languages. Would be interested to hear what my colleagues think - e.g. @adrianeboyd ?

svlandeg avatar Jan 11 '21 19:01 svlandeg

I can't think of anything major, but to be on the safe side we should test it with all the internal training corpora. Let me see...

adrianeboyd avatar Jan 15 '21 14:01 adrianeboyd

I am experiencing a similar behavior with the German word "GmbH".

nlp = spacy.lang.de.German() 
[tok for tok in nlp("Herr Bert ist Geschäftsführer der Ernie GmbH.")]

Results in

[Herr, Bert, ist, Geschäftsführer, der, Ernie, GmbH.]

I followed the example above and added a specific rule to _suffixes

MucAlex avatar Jan 27 '21 13:01 MucAlex