Input normalization
We need to be consistent about how we're normalizing input between training (currently some perl scripts) and running (here).
The students repo currently normalizes with
https://github.com/browsermt/students/blob/master/train-student/clean/tools/remove-non-printing-char.perl
https://github.com/browsermt/students/blob/master/train-student/clean/tools/normalize-punctuation.perl
There's some controversy around punctuation normalization; let's discuss that over at https://github.com/ZJaume/clean/issues/1 . It's quite simply wrong for several languages: https://en.wikipedia.org/wiki/Quotation_mark#Summary_table
Removing non-printing characters can be done without any additional library. Decode UTF-8 with https://github.com/google/sentencepiece/blob/bc53923a9147dc8ffa54034c8ed774de78cc4d39/src/util.cc#L42 Match against the ranges for Cf, Co, Cs, and Cc: https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt
Wherever we end up with normalization it should be implemented in C++ and then used both here and in parallel corpus preparation.
Actually this may not be a bergamot-translator issue at all. We should just build the normalization rules into the SentencePiece vocab. https://github.com/marian-nmt/marian-examples/tree/master/training-basics-sentencepiece