YouTokenToMe [Feature] Add text normalisation as SentencePiece do

[Feature] Add text normalisation as SentencePiece do

Open keotic opened this issue 4 years ago • 2 comments

SentencePiece implementation includes a text normalisation stage which is very useful in reducing the number of individual characters, handling non-printable characters and more. This feature may speed up the run time and yield better tokenisation results

Sep 23 '19 08:09 keotic

Is it common situation when datasets have significant percent non-printable characters? By significant I mean more than 0.1%. In other cases they can be easily filtered out by coverage option.

Sep 23 '19 13:09 xbelonogov

Is it common situation when datasets have significant percent non-printable characters? By significant I mean more than 0.1%. In other cases they can be easily filtered out by coverage option.

It's more that just non printable characters, looking at SentencePiece's nmt_nfkc.tsv norm file, there are more that 220K normalisation rules, including all kind of UTF and asian languages issues. It is very common to have issues without normalisation when working on non formal language datasets like UGC or Twitter.

Sep 23 '19 13:09 keotic

YouTokenToMe YouTokenToMe copied to clipboard

[Feature] Add text normalisation as SentencePiece do

YouTokenToMe
YouTokenToMe copied to clipboard