firefox-translations-training Handle soft hyphens with custom normalization tables

Handle soft hyphens with custom normalization tables

Open eu9ene opened this issue 3 years ago • 4 comments

Ulrich:

The SentencePiece tokenizer should probably be trained with a custom normalization table (see the SentencePiece documentation) that removes soft hyphens in addition to the existing normalization steps.

It requires more clarification whether we need this or not.

Oct 30 '21 00:10 eu9ene

Our current setup uses "nmt_nfkc" for normalization, which is "Compatibility Decomposition, followed by Canonical Composition". That looks like a good normalization strategy for ensuring consistent byte representations of different Unicode codepoints.

The work here, which is enumerated in #69 as well, would be to handle soft hyphens (U+00AD) to canonicalize them as (U+002D) which is a hyphen-minus.

See also: https://github.com/browsermt/bergamot-translator/issues/337

This custom table can be built: https://github.com/google/sentencepiece/blob/master/doc/normalization.md#use-custom-normalization-rule-1

This is also an existing script here: https://gist.github.com/jelmervdl/712ba7a4ed663ce62d43e6f902a7254e#file-update-py

In the pipeline the spm_train call will need updating: https://github.com/search?q=repo%3Amozilla%2Ffirefox-translations-training%20spm_train&type=code

Apr 09 '24 21:04 gregtatum

firefox-translations-training firefox-translations-training copied to clipboard

Handle soft hyphens with custom normalization tables

firefox-translations-training
firefox-translations-training copied to clipboard