Input normalization

Open kpu opened this issue 4 years ago • 1 comments

We need to be consistent about how we're normalizing input between training (currently some perl scripts) and running (here).

The students repo currently normalizes with

https://github.com/browsermt/students/blob/master/train-student/clean/tools/remove-non-printing-char.perl

https://github.com/browsermt/students/blob/master/train-student/clean/tools/normalize-punctuation.perl

There's some controversy around punctuation normalization; let's discuss that over at https://github.com/ZJaume/clean/issues/1 . It's quite simply wrong for several languages: https://en.wikipedia.org/wiki/Quotation_mark#Summary_table

Removing non-printing characters can be done without any additional library. Decode UTF-8 with https://github.com/google/sentencepiece/blob/bc53923a9147dc8ffa54034c8ed774de78cc4d39/src/util.cc#L42 Match against the ranges for Cf, Co, Cs, and Cc: https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt

Wherever we end up with normalization it should be implemented in C++ and then used both here and in parallel corpus preparation.

May 30 '21 23:05 kpu

Actually this may not be a bergamot-translator issue at all. We should just build the normalization rules into the SentencePiece vocab. https://github.com/marian-nmt/marian-examples/tree/master/training-basics-sentencepiece

Jun 01 '21 12:06 kpu