NeMo-text-processing icon indicating copy to clipboard operation
NeMo-text-processing copied to clipboard

Some bugs in English, German, Spanish, Italian normalizers

Open Oktai15 opened this issue 9 months ago • 1 comments

Hi!

I found a bug in English normalization. The following code is applied:

normalizer = Normalizer(
  input_case="cased",
  lang="en",
  deterministic=True,
)
norm_text = normalizer.normalize(text, punct_post_process=True)

text=Here is mail.nasa.gov. norm_text=Here is mail dot nasa dot gov dot expected output=Here is mail dot nasa dot gov.

Similar bug can be reached in German normalization. The following code is applied:

normalizer = Normalizer(
  input_case="cased",
  lang="de",
)
norm_text = normalizer.normalize(text, punct_post_process=True)

text=Here is brettspielversand.de. norm_text=Here is b r e t t s p i e l v e r s a n d punkt de punkt expected output=Here is brettspielversand punkt de.

Similar problem with text=KIM.com-Specials.. I got same problem with website in text on Spanish and Italian.

I also found a specific bug in Spanish normalization. The following code is applied:

normalizer = Normalizer(
  input_case="cased",
  lang="es",
)
norm_text = normalizer.normalize(text, punct_post_process=True)

text=El texto de Li Qin en este libro ahora está disponible en forma de libro electrónico. norm_text=El texto de quincuagésimo primero Qin en este libro ahora está disponible en forma de libro electrónico. Not sure what is expected output, but current norm_text looks not okay.

Oktai15 avatar May 01 '24 21:05 Oktai15

I aslo met similar behavior:

text="Das gibt uns Perspektive, Flexibilität, Optimismus, Engagement und Pluralität in allen Sinnesbereichen.in allen Sinnen." normalized_text="Das gibt uns Perspektive, Flexibilität, Optimismus, Engagement und Pluralität in allen S i n n e s b e r e i c h e n punkt in allen Sinnen."

dmylzenova avatar May 08 '24 06:05 dmylzenova

I aslo met similar behavior:

text="Das gibt uns Perspektive, Flexibilität, Optimismus, Engagement und Pluralität in allen Sinnesbereichen.in allen Sinnen." normalized_text="Das gibt uns Perspektive, Flexibilität, Optimismus, Engagement und Pluralität in allen S i n n e s b e r e i c h e n punkt in allen Sinnen."

The above is expected behavior. The normalizer assumes that consecutive sentences are separated by a period and at least one whitespace. The string quoted above comprises two clauses separated by a period without whitespaces. Adding a whitespace after the period induces correct normalization.

zoobereq avatar May 24 '24 21:05 zoobereq