spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

sentence segmentation handling of guillemets

Open joprice opened this issue 3 months ago • 1 comments

Sentence segmentation doesn't seem to handle guillemets '«' / ». I end up with very large sentences merged together when there is dialogue. I see an old pr that added this for german https://github.com/explosion/spaCy/pull/4237/files and some handling in functions like is_quote, but perhaps the segmenter doesn't take this into account somehow.

How to reproduce the behaviour

Process a document with guillemets quotes like

Léa dit : « Bonjour ! Je suis Léa. Et toi ? » Marc répond : « Salut ! Je suis Marc. » Léa demande : « Où es-tu ? »

and check the values of the is_sent_start flags.

Your Environment

  • spaCy version: 3.8.7
  • Platform: macOS-26.0.1-arm64-arm-64bit
  • Python version: 3.12.11
  • Pipelines: uk_core_news_md (3.8.0), pl_core_news_md (3.8.0), ca_core_news_md (3.8.0), it_core_news_md (3.8.0), ko_core_news_md (3.8.0), da_core_news_md (3.8.0), el_core_news_md (3.8.0), fr_core_news_md (3.8.0), en_core_web_md (3.8.0), es_core_news_md (3.8.0), fr_core_news_sm (3.8.0), ja_core_news_md (3.8.0), de_core_news_md (3.8.0), nl_core_news_md (3.8.0), sv_core_news_md (3.8.0), ro_core_news_md (3.8.0), pt_core_news_md (3.8.0), zh_core_web_md (3.8.0), fi_core_news_md (3.8.0), ru_core_news_md (3.8.0), hu_core_news_md (3.8.0)

joprice avatar Oct 13 '25 21:10 joprice

I will work on this and create a fix

faizanhuda12 avatar Nov 18 '25 22:11 faizanhuda12