firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Handle soft hyphens with custom normalization tables

Open eu9ene opened this issue 3 years ago • 4 comments

Ulrich:

The SentencePiece tokenizer should probably be trained with a custom normalization table (see the SentencePiece documentation) that removes soft hyphens in addition to the existing normalization steps.

It requires more clarification whether we need this or not.

eu9ene avatar Oct 30 '21 00:10 eu9ene

Our current setup uses "nmt_nfkc" for normalization, which is "Compatibility Decomposition, followed by Canonical Composition". That looks like a good normalization strategy for ensuring consistent byte representations of different Unicode codepoints.

The work here, which is enumerated in #69 as well, would be to handle soft hyphens (U+00AD) to canonicalize them as (U+002D) which is a hyphen-minus.

See also: https://github.com/browsermt/bergamot-translator/issues/337

This custom table can be built: https://github.com/google/sentencepiece/blob/master/doc/normalization.md#use-custom-normalization-rule-1

This is also an existing script here: https://gist.github.com/jelmervdl/712ba7a4ed663ce62d43e6f902a7254e#file-update-py

In the pipeline the spm_train call will need updating: https://github.com/search?q=repo%3Amozilla%2Ffirefox-translations-training%20spm_train&type=code

gregtatum avatar Apr 09 '24 21:04 gregtatum