firefox-translations-training
firefox-translations-training copied to clipboard
Handle soft hyphens with custom normalization tables
Ulrich:
The SentencePiece tokenizer should probably be trained with a custom normalization table (see the SentencePiece documentation) that removes soft hyphens in addition to the existing normalization steps.
It requires more clarification whether we need this or not.
Our current setup uses "nmt_nfkc" for normalization, which is "Compatibility Decomposition, followed by Canonical Composition". That looks like a good normalization strategy for ensuring consistent byte representations of different Unicode codepoints.
The work here, which is enumerated in #69 as well, would be to handle soft hyphens (U+00AD
) to canonicalize them as (U+002D
) which is a hyphen-minus.
See also: https://github.com/browsermt/bergamot-translator/issues/337
This custom table can be built: https://github.com/google/sentencepiece/blob/master/doc/normalization.md#use-custom-normalization-rule-1
This is also an existing script here: https://gist.github.com/jelmervdl/712ba7a4ed663ce62d43e6f902a7254e#file-update-py
In the pipeline the spm_train
call will need updating: https://github.com/search?q=repo%3Amozilla%2Ffirefox-translations-training%20spm_train&type=code