nlp-discussion
nlp-discussion copied to clipboard
Existing work: Text normalization
Assuming this also includes text pre-processing,
Unicode normalization
- https://github.com/unicode-rs/unicode-normalization : Unicode Normalization forms according to UAX#15 rules
- Possibly https://github.com/kornelski/deunicode/ : Convert Unicode to ASCII
Case folding
str::to_ascii_lowercaseASCII conversion to lowercase, only ASCII characters, fast, can be done in place.str::to_lowercaseUnicode aware conversion to lowercase, can change the length of the string (some characters can expand into multiple characters when changing the case), cannot be done inplace, relatively slow.- Some intermediary solution between the above two, as discussed in https://github.com/rust-lang/rust/issues/26244#issuecomment-344525748 . Related projects,
- https://github.com/JuliaStrings/utf8proc
In conllx-utils we have a utility (conllx-cleanup) that first normalizes unicode and then rewrites some non-ASCII unicode punctuation signs to ASCII:
https://github.com/danieldk/conllx-utils/blob/master/src/bin/conllx-cleanup.rs https://github.com/danieldk/conllx-utils/blob/master/src/unicode.rs
This helps particularly if the training corpora for a model do not contain such non-ASCII punctuation characters (e.g. the German treebank that we use was originally ISO-8859-15), though the impact is smaller when word embeddings are used.
This is a niche utility, but it shows another type of normalization that would be useful to have in a general normalization crate.
If this includes text preprocessing there's also https://github.com/Matthew-Maclean/english-numbers/ which I use although it seems to be no longer actively developed and doesn't support ordinals. I generally have a need for replacing things like numbers and symbols (i.e. $) with their textual equivalent and have a feeling a lot of that work is yet to be done in the rust NLP space
For FST-based text normalization, I re-implemented the c++ library OpenFST in full Rust : https://github.com/Garvys/rustfst (And it has better performances than OpenFST 😅 ) Might prove usefull.