Spaces impacting tag/pos
How to reproduce the behaviour
Notice the double space in front of sourire in the first case vs. the single space in the second case
Les publics avec un sourire chaleureux et
https://demos.explosion.ai/displacy?text=Les%20publics%20avec%20un%20%20sourire%20chaleureux%20%20et&model=fr_core_news_sm
vs.
Les publics avec un sourire chaleureux et
https://demos.explosion.ai/displacy?text=Les%20publics%20avec%20un%20sourire%20chaleureux%20%20et&model=fr_core_news_sm
Your Environment
- Operating System:
- Python Version Used: 3.12
- spaCy Version Used: v3.5 (displacy) but also in v3.7
- Environment Information:
Semi-related: Any guidance on how to modify the tokenizer so that a double spaces would be placed into whitespace_ (ie. ) and not lead to a SPACE token? I did take note of https://github.com/explosion/spaCy/issues/1707 though putting the additional spaces into whitespace_ seems more logical to me.
Research
a) Maybe related https://github.com/explosion/spaCy/issues/621 b) Semi-related https://stephantul.github.io/spacy/2019/05/01/tokenizationspacy/ c) Semi-related https://github.com/explosion/spaCy/discussions/9978
Maybe we could use infixes or suffixes?