spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Spaces impacting tag/pos

Open lsmith77 opened this issue 1 year ago • 1 comments

How to reproduce the behaviour

Notice the double space in front of sourire in the first case vs. the single space in the second case

Les publics avec un sourire chaleureux et

image

https://demos.explosion.ai/displacy?text=Les%20publics%20avec%20un%20%20sourire%20chaleureux%20%20et&model=fr_core_news_sm

vs.

Les publics avec un sourire chaleureux et

image

https://demos.explosion.ai/displacy?text=Les%20publics%20avec%20un%20sourire%20chaleureux%20%20et&model=fr_core_news_sm

Your Environment

  • Operating System:
  • Python Version Used: 3.12
  • spaCy Version Used: v3.5 (displacy) but also in v3.7
  • Environment Information:

Semi-related: Any guidance on how to modify the tokenizer so that a double spaces would be placed into whitespace_ (ie. ) and not lead to a SPACE token? I did take note of https://github.com/explosion/spaCy/issues/1707 though putting the additional spaces into whitespace_ seems more logical to me.

Research

a) Maybe related https://github.com/explosion/spaCy/issues/621 b) Semi-related https://stephantul.github.io/spacy/2019/05/01/tokenizationspacy/ c) Semi-related https://github.com/explosion/spaCy/discussions/9978

lsmith77 avatar Oct 28 '24 12:10 lsmith77

Maybe we could use infixes or suffixes?

smal8 avatar Nov 12 '24 04:11 smal8