NeMo-text-processing Some bugs in de, es and fr

Some bugs in de, es and fr

Open Oktai15 opened this issue 1 year ago • 2 comments

Hi!

I use the latest NeMo release: 1.1.0. I found the following bugs.

Bugs

German (de):

text: Here is brettspielversand.de. norm_text: Here is b r e t t s p i e l v e r s a n d punkt de. expected output: Here is brettspielversand punkt de.
text: Sinnesbereichen.in allen Sinnen. norm_text:S i n n e s b e r e i c h e n punkt in allen Sinnen. expected output: Sinnesbereichen punkt in allen Sinnen.
text: Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie. norm_text:Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie. expected output: Hier zoome ich auf die Läsion. Wir befinden uns also auf der Zwei-D-Mammographie. (not sure)

For German normalization, I use the following code:

from nemo_text_processing.text_normalization.normalize import Normalizer

normalizer = Normalizer(
  input_case="cased",
  lang="de",
  deterministic=True,
)

norm_text = normalizer.normalize(text, punct_post_process=True)

Spanish (es):

text: El texto de Li Qin en este libro ahora está disponible en forma de libro electrónico. norm_text: El texto de quincuagésimo primero Qin en este libro ahora está disponible en forma de libro electrónico. expected output:El texto de Li Qin en este libro ahora está disponible en forma de libro electrónico. (not sure)

For Spanish normalization, I use the following code:

from nemo_text_processing.text_normalization.normalize import Normalizer

normalizer = Normalizer(
  input_case="cased",
  lang="es",
  deterministic=True,
)

norm_text = normalizer.normalize(text, punct_post_process=True)

French (fr):

text: Les Tech Clippings seront diffusés en exclusivité sur la chaîne Youtube DIGITIMES tous les vendredis à 20h. norm_text: Les Tech Clippings seront diffusés en exclusivité sur la chaîne Youtube DIGITIMES tous les vendredis à 20h. expected output:Les Tech Clippings seront diffusés en exclusivité sur la chaîne YouTube DIGITIMES tous les vendredis à 20 heures. (not sure)

For French normalization, I use the following code:

from nemo_text_processing.text_normalization.normalize import Normalizer

normalizer = Normalizer(
  input_case="cased",
  lang="fr",
  deterministic=True,
)

norm_text = normalizer.normalize(text, punct_post_process=True)

Sep 10 '24 07:09 Oktai15

NeMo-text-processing NeMo-text-processing copied to clipboard

Some bugs in de, es and fr

Bugs

NeMo-text-processing
NeMo-text-processing copied to clipboard