nlp-discussion Existing work: Text normalization

May 08 '19 08:05 danieldk

Assuming this also includes text pre-processing,

Unicode normalization

https://github.com/unicode-rs/unicode-normalization : Unicode Normalization forms according to UAX#15 rules
Possibly https://github.com/kornelski/deunicode/ : Convert Unicode to ASCII

Case folding

str::to_ascii_lowercase ASCII conversion to lowercase, only ASCII characters, fast, can be done in place.
str::to_lowercase Unicode aware conversion to lowercase, can change the length of the string (some characters can expand into multiple characters when changing the case), cannot be done inplace, relatively slow.
Some intermediary solution between the above two, as discussed in https://github.com/rust-lang/rust/issues/26244#issuecomment-344525748 . Related projects,
- https://github.com/JuliaStrings/utf8proc

May 08 '19 09:05 rth

In conllx-utils we have a utility (conllx-cleanup) that first normalizes unicode and then rewrites some non-ASCII unicode punctuation signs to ASCII:

https://github.com/danieldk/conllx-utils/blob/master/src/bin/conllx-cleanup.rs https://github.com/danieldk/conllx-utils/blob/master/src/unicode.rs

This helps particularly if the training corpora for a model do not contain such non-ASCII punctuation characters (e.g. the German treebank that we use was originally ISO-8859-15), though the impact is smaller when word embeddings are used.

This is a niche utility, but it shows another type of normalization that would be useful to have in a general normalization crate.

May 08 '19 10:05 danieldk

If this includes text preprocessing there's also https://github.com/Matthew-Maclean/english-numbers/ which I use although it seems to be no longer actively developed and doesn't support ordinals. I generally have a need for replacing things like numbers and symbols (i.e. $) with their textual equivalent and have a feeling a lot of that work is yet to be done in the rust NLP space

Nov 25 '19 09:11 xd009642

For FST-based text normalization, I re-implemented the c++ library OpenFST in full Rust : https://github.com/Garvys/rustfst (And it has better performances than OpenFST 😅 ) Might prove usefull.

Feb 05 '20 13:02 Garvys

nlp-discussion nlp-discussion copied to clipboard

Existing work: Text normalization

nlp-discussion
nlp-discussion copied to clipboard