YouTokenToMe icon indicating copy to clipboard operation
YouTokenToMe copied to clipboard

[Feature] Add text normalisation as SentencePiece do

Open keotic opened this issue 4 years ago • 2 comments

SentencePiece implementation includes a text normalisation stage which is very useful in reducing the number of individual characters, handling non-printable characters and more. This feature may speed up the run time and yield better tokenisation results

keotic avatar Sep 23 '19 08:09 keotic

Is it common situation when datasets have significant percent non-printable characters? By significant I mean more than 0.1%. In other cases they can be easily filtered out by coverage option.

xbelonogov avatar Sep 23 '19 13:09 xbelonogov

Is it common situation when datasets have significant percent non-printable characters? By significant I mean more than 0.1%. In other cases they can be easily filtered out by coverage option.

It's more that just non printable characters, looking at SentencePiece's nmt_nfkc.tsv norm file, there are more that 220K normalisation rules, including all kind of UTF and asian languages issues. It is very common to have issues without normalisation when working on non formal language datasets like UGC or Twitter.

keotic avatar Sep 23 '19 13:09 keotic