NeMo-text-processing
NeMo-text-processing copied to clipboard
Some bugs in English, German, Spanish, Italian normalizers
Hi!
I found a bug in English normalization. The following code is applied:
normalizer = Normalizer(
input_case="cased",
lang="en",
deterministic=True,
)
norm_text = normalizer.normalize(text, punct_post_process=True)
text=Here is mail.nasa.gov.
norm_text=Here is mail dot nasa dot gov dot
expected output=Here is mail dot nasa dot gov.
Similar bug can be reached in German normalization. The following code is applied:
normalizer = Normalizer(
input_case="cased",
lang="de",
)
norm_text = normalizer.normalize(text, punct_post_process=True)
text=Here is brettspielversand.de.
norm_text=Here is b r e t t s p i e l v e r s a n d punkt de punkt
expected output=Here is brettspielversand punkt de.
Similar problem with text=KIM.com-Specials.
.
I got same problem with website in text on Spanish and Italian.
I also found a specific bug in Spanish normalization. The following code is applied:
normalizer = Normalizer(
input_case="cased",
lang="es",
)
norm_text = normalizer.normalize(text, punct_post_process=True)
text=El texto de Li Qin en este libro ahora está disponible en forma de libro electrónico.
norm_text=El texto de quincuagésimo primero Qin en este libro ahora está disponible en forma de libro electrónico.
Not sure what is expected output, but current norm_text looks not okay.
I aslo met similar behavior:
text="Das gibt uns Perspektive, Flexibilität, Optimismus, Engagement und Pluralität in allen Sinnesbereichen.in allen Sinnen."
normalized_text="Das gibt uns Perspektive, Flexibilität, Optimismus, Engagement und Pluralität in allen S i n n e s b e r e i c h e n punkt in allen Sinnen."
I aslo met similar behavior:
text=
"Das gibt uns Perspektive, Flexibilität, Optimismus, Engagement und Pluralität in allen Sinnesbereichen.in allen Sinnen."
normalized_text="Das gibt uns Perspektive, Flexibilität, Optimismus, Engagement und Pluralität in allen S i n n e s b e r e i c h e n punkt in allen Sinnen."
The above is expected behavior. The normalizer assumes that consecutive sentences are separated by a period and at least one whitespace. The string quoted above comprises two clauses separated by a period without whitespaces. Adding a whitespace after the period induces correct normalization.