tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Add normalization option to Chinese characters (using OpenCC) and separate symbols from merging

Open ecchochan opened this issue 5 years ago • 0 comments

I was working on two things and wish to add this to this library in case somebody needs this:

  1. normalize Chinese characters
  2. avoid symbols from merging.

Features implemented:

  1. Add a normalization option to normalizers::BertNormalizer - norm_options

    • avoid 0-9 from merging - SEPARATE_INTEGERS e.g. 1001 0 0
    • avoid symbols from merging - SEPARATE_SYMBOLS e.g. Mr. StarkMr . Stark
    • convert Simplified to Traditional characters - SIMPL_TO_TRAD e.g.
    • convert Traditional to Simplified characters - TRAD_TO_SIMPL e.g.
    • hand-crafted character mapping - ZH_NORM_MAPPING e.g. [
  2. Check whether OpenCC is installed and can be used - normalizers::opencc_enabled

  3. Test cases

    • Separate Symbols
    • Simpl. to Trad. characters
  4. Expose the above features to Python

How to install OpenCC:

I used the following script to install OpenCC.

sudo su

apt-get install -y build-essential pkg-config opencc cmake doxygen

git clone https://github.com/BYVoid/OpenCC.git && cd OpenCC

git checkout ver.1.1.1  # or whatever version

make && make install
cd .. && rm -r OpenCC

Usage

from tokenizers.normalizers import BertNormalizer, NORM_OPTIONS
normalizer = BertNormalizer(
  norm_options=(
    NORM_OPTIONS.ZH_NORM_MAPPING | 
    NORM_OPTIONS.SIMPL_TO_TRAD | 
    NORM_OPTIONS.SEPARATE_INTEGERS | 
    NORM_OPTIONS.SEPARATE_SYMBOLS
    # Enable the options according to your needs
  )
)

Limitations

  1. I used lazy_static to load OpenCC, to avoid unnecessary loading of OpenCC when not enabling it. If the first tokenizer created does not have OpenCC enabled, all other tokenizers created afterwards will not be able to use OpenCC. Awaiting better solutions 🤒

Updates (2020/10/19)

I see the checks failed from the dependency of rust-opencc which requires OpenCC library to be installed.

I have made the opencc a feature instead.

Install using:

# Rust
cargo build --features opencc

# Python
python3 setup.py install --opencc

ecchochan avatar Oct 19 '20 05:10 ecchochan