tokenizers
tokenizers copied to clipboard
Add normalization option to Chinese characters (using OpenCC) and separate symbols from merging
I was working on two things and wish to add this to this library in case somebody needs this:
- normalize Chinese characters
- avoid symbols from merging.
Features implemented:
-
Add a normalization option to
normalizers::BertNormalizer-norm_options- avoid 0-9 from merging -
SEPARATE_INTEGERSe.g.100→1 0 0 - avoid symbols from merging -
SEPARATE_SYMBOLSe.g.Mr. Stark→Mr . Stark - convert Simplified to Traditional characters -
SIMPL_TO_TRADe.g.头→頭 - convert Traditional to Simplified characters -
TRAD_TO_SIMPLe.g.頭→头 - hand-crafted character mapping -
ZH_NORM_MAPPINGe.g.【→[
- avoid 0-9 from merging -
-
Check whether OpenCC is installed and can be used -
normalizers::opencc_enabled -
Test cases
- Separate Symbols
- Simpl. to Trad. characters
-
Expose the above features to Python
How to install OpenCC:
I used the following script to install OpenCC.
sudo su
apt-get install -y build-essential pkg-config opencc cmake doxygen
git clone https://github.com/BYVoid/OpenCC.git && cd OpenCC
git checkout ver.1.1.1 # or whatever version
make && make install
cd .. && rm -r OpenCC
Usage
from tokenizers.normalizers import BertNormalizer, NORM_OPTIONS
normalizer = BertNormalizer(
norm_options=(
NORM_OPTIONS.ZH_NORM_MAPPING |
NORM_OPTIONS.SIMPL_TO_TRAD |
NORM_OPTIONS.SEPARATE_INTEGERS |
NORM_OPTIONS.SEPARATE_SYMBOLS
# Enable the options according to your needs
)
)
Limitations
- I used
lazy_staticto load OpenCC, to avoid unnecessary loading of OpenCC when not enabling it. If the first tokenizer created does not have OpenCC enabled, all other tokenizers created afterwards will not be able to use OpenCC. Awaiting better solutions 🤒
Updates (2020/10/19)
I see the checks failed from the dependency of rust-opencc which requires OpenCC library to be installed.
I have made the opencc a feature instead.
Install using:
# Rust
cargo build --features opencc
# Python
python3 setup.py install --opencc