nlp-discussion icon indicating copy to clipboard operation
nlp-discussion copied to clipboard

Existing work: Text normalization

Open danieldk opened this issue 6 years ago • 4 comments

danieldk avatar May 08 '19 08:05 danieldk

Assuming this also includes text pre-processing,

Unicode normalization

  • https://github.com/unicode-rs/unicode-normalization : Unicode Normalization forms according to UAX#15 rules
  • Possibly https://github.com/kornelski/deunicode/ : Convert Unicode to ASCII

Case folding

  • str::to_ascii_lowercase ASCII conversion to lowercase, only ASCII characters, fast, can be done in place.
  • str::to_lowercase Unicode aware conversion to lowercase, can change the length of the string (some characters can expand into multiple characters when changing the case), cannot be done inplace, relatively slow.
  • Some intermediary solution between the above two, as discussed in https://github.com/rust-lang/rust/issues/26244#issuecomment-344525748 . Related projects,
    • https://github.com/JuliaStrings/utf8proc

rth avatar May 08 '19 09:05 rth

In conllx-utils we have a utility (conllx-cleanup) that first normalizes unicode and then rewrites some non-ASCII unicode punctuation signs to ASCII:

https://github.com/danieldk/conllx-utils/blob/master/src/bin/conllx-cleanup.rs https://github.com/danieldk/conllx-utils/blob/master/src/unicode.rs

This helps particularly if the training corpora for a model do not contain such non-ASCII punctuation characters (e.g. the German treebank that we use was originally ISO-8859-15), though the impact is smaller when word embeddings are used.

This is a niche utility, but it shows another type of normalization that would be useful to have in a general normalization crate.

danieldk avatar May 08 '19 10:05 danieldk

If this includes text preprocessing there's also https://github.com/Matthew-Maclean/english-numbers/ which I use although it seems to be no longer actively developed and doesn't support ordinals. I generally have a need for replacing things like numbers and symbols (i.e. $) with their textual equivalent and have a feeling a lot of that work is yet to be done in the rust NLP space

xd009642 avatar Nov 25 '19 09:11 xd009642

For FST-based text normalization, I re-implemented the c++ library OpenFST in full Rust : https://github.com/Garvys/rustfst (And it has better performances than OpenFST 😅 ) Might prove usefull.

Garvys avatar Feb 05 '20 13:02 Garvys