pycantonese icon indicating copy to clipboard operation
pycantonese copied to clipboard

Segmenter removes space of English words in code-mixed sentence

Open shivanraptor opened this issue 10 months ago • 4 comments

Describe the bug Segmenter removes space of English words in code-mixed sentence, for example this sentence:

這是Career Centre

To reproduce Here is the code:

import pycantonese
from pycantonese.word_segmentation import Segmenter
segmenter = Segmenter()
pyseg = pycantonese.segment("這是Career Centre", cls=segmenter)
for word in pyseg:
    print(word)

The output is:

這是
CareerCentre

Expected behavior The expected output is:

這是
Career Centre

or

這是
Career
Centre

System (please complete the following information):

  • Operating System: macOS Sonoma 14.0 (23A344)
  • PyCantonese version: 3.4.0

shivanraptor avatar Oct 06 '23 09:10 shivanraptor

After a dig in the old issues, I thought this issue was fixed in https://github.com/jacksonllee/pycantonese/issues/32#issuecomment-1268983221, but it isn't.

shivanraptor avatar Oct 09 '23 07:10 shivanraptor

主要係因為呢個 https://github.com/jacksonllee/pycantonese/pull/35 未解決所以一直都未發佈更新。

laubonghaudoi avatar Oct 09 '23 20:10 laubonghaudoi

I guess I have to wait then.

shivanraptor avatar Oct 11 '23 02:10 shivanraptor

You can replace the space with some uncommon punctuations, such as "▁". And then skip it.

https://github.com/pengzhendong/g2p-mix/blob/dd19bee513cc13230c41ef66e479de695afa0e2c/g2p_mix/g2p_mix.py#L43

pengzhendong avatar May 11 '24 12:05 pengzhendong