OpenIME
OpenIME copied to clipboard
Open Vocabulary Learning for Neural Chinese Pinyin IME (ACL 2020)
Dataset and codes accompanying the paper Open Vocabulary Learning for Neural Chinese Pinyin IME.
Dataset
Two processed corpora for IME evaluation, the People’s Daily corpus (PD) and the TouchPal corpus (TP) .
Chinese | Pinyin | ||
PD | MIUs | 5.04M | |
Word | 24.7M | 24.7M | |
Vocab | 54.3K | 41.1K | |
Target Vocab (train) | 2309 | - | |
Target Vocab (dec) | 2168 | - | |
TP | MIUs | 689.6K | |
Word | 4.1M | 4.1M | |
Vocab | 27.2K | 20.2K | |
Target Vocab (train) | 2020 | - | |
Target Vocab (dec) | 2009 | - |
.ali target
.py source
.adddict training set
.test2k test set
The full corpus and pre-trained vectors can be downloaded from https://drive.google.com/drive/folders/1v6QW7ULu-iYxU5uruiuSgYGmoXOcHAeX?usp=sharing
Source Code
We also release our source codes to help others reproduce our result, which is modified from OpenNMT with similar usage.
Reference
If you use this repo please cite our paper:
@inproceedings{zhang2019acl-ime,
title = "{Open Vocabulary Learning for Neural Chinese Pinyin IME}",
author = "Zhang, Zhuosheng and Huang, Yafang and Zhao, Hai",
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)",
year = "2019",
}